After the data and ontology layers of the Semantic Web stack
have achieved a certain level of maturity in standard
recommendations such as RDF and OWL, the query and the rules
layers seem to be the next building-blocks to be finalized. For
the first part, SPARQL [18], W3C's proposed query
language, seems to be close to recommendation, though the Data
Access working group is still struggling with defining aspects
such as a formal semantics or layering on top of OWL and RDFS. As
for the second part, the RIF working group 2, who is
responsible for the rules layer, is just producing first concrete
results. Besides aspects like business rules exchange or reactive
rules, deductive rules languages on top of RDF and OWL are of
special interest to the RIF group. One such deductive rules
language is Datalog, which has been successfully applied in areas
such as deductive databases and thus might be viewed as a query
language itself. Let us briefly recap our starting
points:
Datalog and SQL. Analogies between Datalog and
relational query languages such as SQL are well-known and
-studied. Both formalisms cover UCQ (unions of conjunctive
queries), where Datalog adds recursion, particularly unrestricted
recursion involving nonmonotonic negation (aka unstratified
negation as failure). Still, SQL is often viewed to be more
powerful in several respects. On the one hand, the lack of
recursion has been partly solved in the standard's 1999 version
[20]. On the other hand,
aggregates or external function calls are missing in pure
Datalog. However, also developments on the Datalog side are
evolving and with recent extensions of Datalog towards Answer Set
Programming (ASP) - a logic programming paradigm extending and
building on top of Datalog - lots of these issues have been
solved, for instance by defining a declarative semantics for
aggregates [9],
external predicates [8].
The Semantic Web rules layer. Remarkably, logic programming
dialects such as Datalog with nonmonotonic negation which are
covered by Answer Set Programming are often viewed as a natural
basis for the Semantic Web rules layer [7]. Current ASP systems
offer extensions for retrieving RDF data and querying OWL
knowledge bases from the Web [8]. Particular
concerns in the Semantic Web community exist with respect to
adding rules including nonmonotonic negation [3] which involve a
form of closed world reasoning on top of RDF and OWL which both
adopt an open world assumption. Recent proposals for solving this
issue suggest a ``safe'' use of negation as failure over finite
contexts only for the Web, also called scoped negation
[17].
The
Semantic Web query layer - SPARQL. Since we base our
considerations in this paper on the assumption that similar
correspondences as between SQL and Datalog can be established for
SPARQL, we have to observe that SPARQL inherits a lot from SQL,
but there also remain substantial differences: On the one hand,
SPARQL does not deal with nested queries or recursion, a detail
which is indeed surprising by the fact that SPARQL is a graph
query language on RDF where, typical recursive queries such as
transitive closure of a property might seem very useful.
Likewise, aggregation (such as count, average, etc.) of object
values in RDF triples which might appear useful have not yet been
included in the current standard. On the other hand, subtleties
like blank nodes (aka bNodes), or optional graph patterns, which
are similar but (as we will see) different to outer joins in SQL
or relational algebra, are not straightforwardly translatable to
Datalog. The goal of this paper is to shed light on the actual
relation between declarative rules languages such as Datalog and
SPARQL, and by this also provide valuable input for the currently
ongoing discussions on the Semantic Web rules layer, in
particular its integration with SPARQL, taking the likely
direction into account that LP style rules languages will play a
significant role in this context.
Although the SPARQL specification does not seem 100% stable at
the current point, just having taken a step back from candidate
recommendation to working draft, we think that it is not too
early for this exercise, as we will gain valuable insights and
positive side effects by our investigation. More precisely, the
contributions of the present work are:
- We refine and extend a recent proposal to formalize the
semantics of SPARQL from Pérez et al. [16], presenting three
variants, namely c-joining, s-joining and b-joining semantics
where the latter coincides with [16], and can thus be
considered normative. We further discuss how aspects such
compositionality, or idempotency of joins are treated in these
semantics.
- Based on the three semantic variants, we provide
translations from a large fragment of SPARQL queries to
Datalog, which give rise to implementations of SPARQL on top of
existing engines.
- We provide some straightforward extensions of SPARQL such
as a set difference operator MINUS, and nesting of
ASK queries in FILTER expressions.
- Finally, we discuss an extension towards recursion by
allowing bNode-free-CONSTRUCT queries as part of the
query dataset, which may be viewed as a light-weight, recursive
rule language on top of of RDF.
The remainder of this paper is structured as follows: In
Sec. 2 we first overview SPARQL,
discuss some issues in the language (Sec. 2.1) and then define its formal semantics (Sec.
2.2). After introducing a general
form of Datalog with negation as failure under the answer set
semantics in Sec. 3, we proceed with
the translations of SPARQL to Datalog in Sec. 4. We finally discuss the above-mentioned
language extensions in Sec. 5,
before we conclude in Sec. 6.
2 RDF and SPARQL
In examples, we will subsequently refer to
the two RDF graphs in Fig. 1 which
give some information about and .
Such information is common in FOAF files which are gaining
popularity to describe personal data. Similarities with existing
examples in [18] are on
purpose. We assume the two RDF graphs given in TURTLE [2] notation and accessible via
the IRIs ex.org/bob and alice.org3
We assume the pairwise disjoint, infinite sets , , and , which denote IRIs,
Blank nodes, RDF literals, and variables respectively. In this
paper, an RDF Graph is then a finite set, of triples from
,4
dereferenceable by an IRI. A SPARQL query is a quadruple
, where
is a result form,
is a graph pattern,
is a dataset, and
is a set of solution
modifiers. We refer to [18] for syntactical details and
will explain these in the following as far as necessary. In this
paper, we will ignore solution modifiers mostly, thus we will
usually write queries as triples
, and will
use the syntax for graph patterns introduced below.
Since we will,
to a large extent, restrict ourselves to SELECT queries,
it is sufficient for our purposes to describe result forms by
sets variables. Other result forms will be discussed in Sec.
5. For instance, let
denote the
query from Fig. 1, then
. Query
results in SPARQL are given by partial, i.e. possibly incomplete,
substitutions of variables in
by RDF terms. In traditional relational query languages, such
incompleteness is usually expressed using null values.
Using such null values we will write solutions as tuples
where the order of columns is determined by lexicographically
ordering the variables in
. Given a set of variables
, let
denote the
tuple obtained from lexicographically ordering .
The query from Fig. 1 with result
form
then
has solution tuples
,
,
. We write
substitutions in sqare brackets, so these tuples correspond to
the substitutions
,
,
,
, and
,
,
respectively.
We follow
the recursive definition of graph patterns from [16]:
- a tuple is a
graph pattern where
and
.5
- if and are graph patterns
then
,
,
,
are graph
patterns.6
- if is a graph
pattern and
, then
is a graph pattern.
- if is a graph
pattern and is a filter
expression then
is a graph
pattern.
For any pattern , we
denote by the set of
all variables occurring in
. As atomic filter expression, SPARQL allows the unary
predicates BOUND, isBLANK, isIRI,
isLITERAL, binary equality predicates ' ' for literals, and
other features such as comparison operators, data type conversion
and string functions which we omit here, see [18, Sec. 11.3] for details.
Complex filter expressions can be built using the
connectives '
',' ',' '.
The dataset
of a SPARQL query
is defined by a default graph
plus a set of named graphs, i.e. pairs of IRIs and corresponding
graphs. Without loss of generality (there are other ways to
define the dataset such as in a SPARQL protocol query), we assume
given as the merge of
the graphs denoted by the IRIs given in a set of FROM and
FROM NAMED clauses. For instance, the query from Fig.
1 refers to the dataset which
consists of the default graph obtained from merging
alice.org
ex.org/bob plus an empty set of named graphs.
The relation between names and graphs in SPARQL is defined
solely in terms of that the IRI defines a resource which is
represented by the respective graph. In this paper, we assume
that the IRIs represent indeed network-accessible resources where
the respective RDF-graphs can be retrieved from. This view has
also be taken e.g. in [17]. Particularly,
this treatment is not to be confused with so-called named graphs
in the sense of [4].
We thus identify each IRI with the RDF graph available at this
IRI and each set of IRIs with the graph merge [13] of the respective IRIs.
This allows us to identify the dataset by a pair of sets of IRIs
with
and
denoting the (merged) default graph and the set of named graphs,
respectively. Hence, the following set of clauses
FROM <ex.org/bob>
FROM NAMED <alice.org>
defines the dataset
.
2.1 Assumptions and Issues
In this section we will discuss
some important issues about the current specification, and how we
will deal with them here.
First, note that the default graph if specified by name in a
FROM clause is not counted among the named graphs
automatically [18,
section 8, definition 1]. An unbound variable in the GRAPH
directive, means any of the named graphs in , but does NOT
necessarily include the default graph.
Example 1 This issue
becomes obvious in the following query with dataset
which has an
empty solution set.
SELECT ?N WHERE {?G foaf:maker ?M .
GRAPH ?G { ?X foaf:name ?N } }
We will sometimes find the following assumption convenient
to avoid such arguably unintuitive effects:
Under this assumption, the previous query has both
and
in its solution
set.
Some more remarks are in place concerning FILTER
expressions. According to the SPARQL specification ``Graph
pattern matching creates bindings of variables [where] it is
possible to further restrict solutions by constraining the
allowable bindings of variables to RDF Terms [with FILTER
expressions].'' However, it is not clearly specified how to
deal with filter constraints referring to variables which do not
appear in simple graph patterns. In this paper, for graph
patterns of the form
we tacitly assume safe
filter expressions, i.e. that all variables used in a filter
expression also appear in the
corresponding pattern . This
corresponds with the notion of safety in Datalog (see
Sec.3), where the built-in predicates
(which obviously correspond to filter predicates) do not suffice
to safe unbound variables.
Moreover, the specification defines errors to avoid mistyped
comparisons, or evaluation of built-in functions over unbound
values, i.e. ``any potential solution that causes an error
condition in a constraint will not form part of the final
results, but does not cause the query to fail.'' These errors
propagate over the whole FILTER expression, also over
negation, as shown by the following example.
We will take special care for these errors, when defining the
semantics of FILTER expressions later on.
2.2 Formal Semantics of SPARQL
The semantics of SPARQL is
still not formally defined in its current version. This lack of
formal semantics has been tackled by a recent proposal of Pérez
et al. [16]. We
will base on this proposal, but suggest three variants thereof,
namely (a) bravely joining, (b) cautiously-joining,
and (c) strictly-joining semantics. Particularly, our
definitions vary from [16] in the way we
define joining unbound variables. Moreover, we will refine their
notion of FILTER satisfaction in order to deal with error
propagation properly.
We denote by
the union
, where
is a
dedicated constant denoting the unknown value not appearing in
any of or , how it is commonly
introduced when defining outer joins in relational algebra.
A substitution from to
is a partial function
. We write substitutions in postfix notation: For a triple
pattern we denote
by the triple
obtained by applying the substitution to all variables in
. The domain of
,
, is the
subset of where is defined. For a
substitution and a set
of variables
we define
the substitution
with domain as follows:
Let and be
substitutions, then
is
the substitution obtained as follows:
Thus, in the union of two substitutions defined values in
one take precedence over null values the other
substitution. For instance, given the substitutions
and
we get:
Now, as opposed to [16], we define three
notions of compatibility between substitutions:
- Two substitutions and
are bravely compatible (b-compatible) when for all
either
or
or
holds. i.e., when
is a substitution over
.
- Two substitutions and
are cautiously compatible (c-compatible) when they are
b-compatible and for all
it holds that
.
- Two substitutions and
are strictly compatible (s-compatible) when they are
c-compatible and for all
it holds that
.
Analogously to [16] we define join,
union, difference, and outer join between two sets of
substitutions and
over domains
and , respectively, all
except union parameterized by
:
The semantics of a graph pattern over dataset
, can now be
defined recursively by the evaluation function returning sets of
substitutions.
Definition 2 (
Evaluation, extends [16, Def. 2])
Let
be a triple
pattern, graph
patterns,
a dataset,
and , and
, then the
-joining evaluation
is
defined as follows:
Let be a FILTER
expression,
,
. The valuation of
on
substitution
,
written
takes one
of the three values
7and is defined as
follows.
,
if:
(1)
;
(2)
;
(3)
;
(4)
;
(5)
;
(6)
;
(7)
;
(8)
;
(9)
.
,
if:
(1)
,
,
, or with
;
(2)
;
(3)
;
(4)
);
(5)
.
otherwise.
We will now exemplify the three different semantics defined
above, namely bravely joining (b-joining), cautiously joining
(c-joining), and strictly-joining (s-joining) semantics. When
taking a closer look to the AND and MINUS
operators, one will realize that all three semantics take a
slightly differing view only when joining null. Indeed,
the AND operator behaves as the traditional natural join
operator in
relational algebra, when no null values are involved.
Take for instance,
and
AND
. When viewing each solution
set as a relational table with variables denoting attribute
names, we can write:
?X |
?Name |
|
"Bob" |
alice.org#me |
"Alice" |
|
"Bob" |
=
?X |
?Name |
?Friend |
|
"Bob" |
|
alice.org#me |
"Alice" |
|
Differences between the three semantics appear when joining
over null-bound variables, as shown in the next
example.
Example 3 Let
be as before and
assume the following query which might be considered a naive
attempt to ask for pairs of persons , who
share the same name and nickname where both, name and nickname
are optional:
Again, we consider the tabular view of the resulting
join:
?X1 |
?N |
|
"Bob" |
|
null |
|
"Bob" |
alice.org#me |
"Alice" |
?X2 |
?N |
|
null |
|
"Alice" |
|
"Bobby" |
alice.org#me |
null |
Now, let us see what happens when we evaluate the join
with respect
to the different semantics. The following result table lists in
the last column which tuples belong to the result of b-, c- and
s-join, respectively.
?X1 |
?N |
X2 |
|
|
"Bob" |
|
b |
|
"Bob" |
alice.org#me |
b |
|
null |
|
b,c |
|
"Alice" |
|
b |
|
"Bobby" |
|
b |
|
null |
alice.org#me |
b,c |
|
"Bob" |
|
b |
|
"Bob" |
alice.org#me |
b |
alice.org#me |
"Alice" |
|
b |
alice.org#me |
"Alice" |
|
b,c,s |
alice.org#me |
"Alice" |
alice.org#me |
b |
Leaving aside the question whether the query
formulation was intuitively broken, we remark that only the
s-join would have the expected result. At the very least we
might argue, that the liberal behavior of b-joins might be
considered surprising in some cases. The c-joining semantics
acts a bit more cautious in between the two, treating null
values as normal values, only unifiable with other
null
values.
Compared to how joins over incomplete relations are treated in
common relational database systems, the s-joining semantics might
be considered the intuitive behavior. Another interesting
divergence (which would rather suggest to adopt the c-joining
semantics) shows up when we consider a simple idempotent
join.
Example 4 Let us
consider the following single triple dataset
and the following simple query pattern:
Clearly, this pattern, has the solution set
under all three semantics. Surprisingly,
has different solution
sets for the different semantics. First,
, but
, since null values are
not compatible under the s-joining semantics. Finally,
As shown by this example, under the reasonable assumption,
that the join operator is idempotent, i.e.,
,
only the c-joining semantics behaves correctly.
However, the
brave b-joining behavior is advocated by the current SPARQL
document, and we might also think of examples where this
obviously makes a lot of sense. Especially, when considering no
explicit joins, but the implicit joins within the OPT
operator:
All three semantics may be considered as variations of the
original definitions in [16], for which the
authors proved complexity results and various desirable features,
such as semantics-preserving normal form transformations and
compositionality. The following proposition shows that all these
results carry over to the normative b-joining semantics:
Proposition 1 Given a dataset and a pattern
which does not
contain GRAPH patterns, the solutions of
as in
[
16] and
are in
1-to-1 correspondence.
Proof. Given
and
each substitution
obtained by
evaluation
can be
reduced to a substitution
obtained from the evaluation
in
[
16] by
dropping all mappings of the form
from
. Likewise, each
substitution
obtained from
can be
extended to a substitution
for
.
Following the definitions from the SPARQL specification and
[16], the
b-joining semantics is the only admissible definition. There are
still advantages for gradually defining alternatives towards
traditional treatment of joins involving null s. On the
one hand, as we have seen in the examples above, the brave view
on joining unbound variables might have partly surprising
results, on the other hand, as we will see, the c- and s-joining
semantics allow for a more efficient implementation in terms of
Datalog rules.
Let us now take a closer look on some properties of the three
defined semantics.
As shown in [16], some
implementations have a non-compositional semantics, leading to
undesired effects such as non-commutativity of the join operator,
etc. A semantics is called compositional if for each
sub-pattern of
the result of
evaluating can be used to
evaluate . Obviously, all
three the c-, s- and b-joining semantics defined here retain this
property, since all three semantics are defined recursively, and
independent of the evaluation order of the sub-patterns.
The following proposition summarizes equivalences which hold
for all three semantics, showing some interesting additions to
the results of Pérez et al.
Proof. [Sketch.] (1-5) for the b-joining semantics are
proven in [
16], (1): for
c-joining and s-joining follows straight from the definitions.
(2)-(4): the substitution sets
,
,
provide counterexamples for c-joining and s-joining semantics
for all three equivalences (2)-(4). (5): The semantics of
FILTER expressions and
UNION is exactly the same
for all three semantics, thus, the result for the b-joining
semantics carries over to all three semantics. (6): follows
from the observations in Example
4.
Ideally, we would like to identify a subclass of programs,
where the three semantics coincide. Obviously, this is the case
for any query involving neither UNION nor OPT
operators. Pérez et al. [16] define a bigger
class of programs, including ``well-behaving'' optional
patterns:
Definition 3 (
[16, Def. 4])
A UNION-free graph pattern
is
well-designed if for every occurrence of a
sub-pattern
of
and for every
variable
occurring in
, the following
condition holds: if
occurs both in
and
outside
then it also
occurs in
.
As may be easily verified by the reader, neither Example
3 nor Example 5, which
are both UNION-free, satisfy the well-designedness
condition. Since in the general case the equivalences for Prop.
2 do not hold, we also need to consider
nested UNION patterns as a potential source for
null bindings which might affect join results. We extend
the notion of well-designedness, which direclty leads us to
another correspondence in the subsequent proposition.
Definition 4
A graph pattern is
well-designed if the condition from Def. 3 holds and for every occurrence of a
sub-pattern
of and for every
variable occurring in
, the following
condition holds: if
occurs outside then
it occurs in both and
.
Proposition 3 On
well-designed graph patterns the c-, s-, and b-joining
semantics coincide.
Proof. [Sketch.] Follows directly from the observation
that all variables which are re-used outside
must be bound to a
value unequal to
null in
due to well-designedness, and thus cannot generate
null bindings which might carry over to joins.
Likewise, we can identify ``dangerous'' variables in graph
patterns, which might cause semantic differences:
Definition 5 Let a sub-pattern of of either the form
or
. Any variable
in which violates the
well-designedness-condition is called
possibly-null-binding in .
Note that, so far we have only defined the semantics in
terms of a pattern and
dataset , but not yet
taken the result form of query
into
account.
We now define solution tuples that were informally
introduced in Sec. 2. Recall that by
we denote
the tuple obtained from lexicographically ordering a set of
variables in . The notion
means that,
after ordering all variables
from a subset
are
replaced by null.
Let us remark at this point, that as for the discussion of
intuitivity of the different join semantics discussed in Examples
3-5, we did not yet
consider combinations of different join semantics, e.g. using
b-joins for OPT and c-joins for AND patterns. We
leave this for further work.
3 Datalog and Answer Sets
In this paper we will use a very general form of Datalog
commonly referred to as Answer Set Programming (ASP), i.e.
function-free logic programming (LP) under the answer set
semantics [1,
11]. ASP is widely proposed as a useful tool for various
problem solving tasks in e.g. Knowledge Representation and
Deductive databases. ASP extends Datalog with useful features
such as negation as failure, disjunction in rule heads,
aggregates [9],
external predicates[8], etc. 8
Let , , , be sets of
predicate, constant, variable symbols, and external predicate
names, respectively. Note that we assume all these sets except
and (which may
overlap), to be disjoint. In accordance with common notation in
LP and the notation for external predicates from [7] we will in the
following assume that and
comprise sets of
numeric constants, string constants beginning with a lower case
letter, or '"' quoted strings, and strings of the form
quoted-string
^
^
IRI ,
quoted-string
valid-lang-tag ,
is the set of string
constants beginning with an upper case letter. Given
an
atom is defined as
,
where is called the arity of
and
.
Moreover, we define a fixed set of external predicates
, , , ,
All external predicates
have a fixed semantics and fixed arities, distinguishing
input and output terms. The atoms
,
,
test
the input term
(in
square brackets) for being valid string representations of Blank
nodes, IRI References or RDF literals, returning an output value
,
representing truth, falsity or an error, following the semantics
defined in [18, Sec.
11.3]. For the predicate we
write atoms as
to denote
that
is
an input term, whereas
are output terms which may be bound by the external predicate.
The external atom
is true if
is an RDF
triple entailed by the RDF graph which is accessibly at
IRI . For the moment, we
consider simple RDF entailment [13] only. Finally, we write
comparison atoms ' '
and '
' in infix notation
with
and the obvious semantics of (lexicographic or numeric)
(in)equality. Here, for
either or is an output term,
but at least one is an input term, and for
both and are
input terms.
Definition 7 Finally, a rule is of the
form
: -
|
(1) |
where and (
) are
atoms, (
) are
either atoms or external atoms, and
is the
symbol for negation as failure.
We use to denote
the head atom and
to denote the set
of all body literals
of
, where
and
.
The notion of input and output terms in external atoms
described above denotes the binding pattern. More precisely, we
assume the following condition which extends the standard notion
of safety (cf. [21])
in Datalog with negation: Each variable appearing in a rule must
appear in in an atom or
as an output term of an external atom.
Definition 8 A (logic) program is defined as a
set of safe rules of
the form (1).
The Herbrand base of a program , denoted
, is the set of
all possible ground versions of atoms and external atoms
occurring in obtained by
replacing variables with constants from , where we define for our purposes by
the union of the
set of all constants appearing in as well as the literals, IRIs, and distinct constants
for each blank node occurring in each RDF graph
identified9 by one of the IRIs in the
(recursively defined) set
, where is defined by the
recursive closure of all IRIs appearing in and all RDF graphs
identified by IRIs in .10As
long as we assume that the Web is finite the grounding of a rule
,
, is
defined by replacing each variable with the possible elements of
, and the
grounding of program is
.
An interpretation relative to is any subset
containing only atoms. We
say that
is a
model of atom
, denoted
, if
. With
every external predicate name
with arity
we associate an
-ary Boolean
function (called
oracle function) assigning each tuple
either 0 or . 11We say that
is a model of a
ground external atom
, denoted
,
iff
.
The semantics we use here generalizes the answer-set semantics
[11]12,
and is defined using the FLP-reduct [9], which is more
elegant than the traditional GL-reduct [11] of stable model
semantics and ensures minimality of answer sets also in presence
of external atoms.
Let be a ground rule. We
define (i)
iff
for all
and
for all
, and (ii)
iff
whenever
. We say that
is a
model of a program , denoted
, iff
for all
.
The FLP-reduct [9] of with respect to
, denoted
, is
the set of all
such that
.
is an
answer set of
iff
is a minimal
model of
.
We did not consider further extensions common to many ASP
dialects here, namely disjunctive rule heads, strong negation
[11]. We note
that for non-recursive programs, i.e. where the predicate
dependency graph is acyclic, the answer set is unique. For the
pure translation which we will give in Sec. 4 where we will produce such non-recursive
programs from SPARQL queries, we could equally take other
semantics such as the well-founded [10] semantics into
account, which coincides with ASP on non-recursive programs.
4 From SPARQL to Datalog
We are now ready to define a
translation from SPARQL to Datalog which can serve
straightforwardly to implement SPARQL within existing rules
engines. We start with a translation for c-joining semantics,
which we will extend thereafter towards s-joining and b-joining
semantics.
Figure 2: Translation
from SPARQL queries
semantics to Datalog.
|
Let
, where
as defined
above. We translate this query to a logic program
defined as follows.
: -
: -
The first two rules serve to import the relevant RDF triples from
the dataset into a 4-ary predicate
. Under
the dataset closedness assumption (see Def. 1) we may replace the second rule set, which
imports the named graphs, by:
: -
Here, the predicate stands for ``Herbrand universe'', where we use this
name a bit sloppily, with the intention to cover all the
relevant part of
,
recursively importing all possible IRIs in order to emulate the
dataset closedness assumption. , can be computed recursively over the input triples,
i.e.
: -
: -
: -
: -
The remaining program
represents the actual query
translation, where is
defined recursively as shown in Fig. 2.
By we mean the
set of rules resulting from disassembling complex FILTER
expressions (involving ' ','
',' ') according to the
rewriting defined by Lloyd and Topor [15] where we have to obey
the semantics for errors, following Definition 2. In a nutshell, the rewriting
proceeds as follows: Complex filters involving are transformed into
negation normal form. Conjunctions of filter expressions are
simply disassembled to conjunctions of body literals,
disjunctions are handled by splitting the respective rule for
both alternatives in the standard way. The resulting rules
involve possibly negated atomic filter expressions in the bodies.
Here, is translated
to
,
to
.
, ,
and their
negated forms are replaced by their corresponding external atoms
(see Sec. 3)
or
, etc.,
respectively.
The resulting program
implements the c-joining
semantics in the following sense:
Without giving a proof, we remark that the result follows
if we convince ourselves that
emulates
exactly the recursive definition of
. Moreover,
together with Proposition 3, we obtain
soundness and completeness of for b-joining and s-joining semantics as well for
well-designed query patterns.
Corollary 1 For
, if
is well-designed,
then the extension of predicate
in
the unique answer set of
represents all and only
the solution tuples for with respect to the -joining semantics, for
.
Now, in order to obtain a proper translation for arbitrary
patterns, we obviously need to focus our attention on the
possibly-null-binding variables within the query pattern
. Let denote the
possibly-null-binding variables in a (sub)pattern
. We need to consider
all rules in Fig. 2 which involve
-joins, i.e. the rules
of the forms (2),(5) and (6). Since rules (5) and (6) do not make
this join explicit, we will replace them by the equivalent rules
(5') and (6') for
and
. The ``extensions'' to
s-joining and b-joining semantics can be achieved by rewriting
the rules (2) and (6'). The idea is to rename variables and add
proper FILTER expressions to these rules in order to
realize the b-joining and s-joining behavior for the variables in
.
The s-joining
behavior can be achieved by adding FILTER expressions
to the rule bodies of (2) and (6'). The resulting rules are
again subject to the
-rewriting as discussed above for the rules of the form (7). This
is sufficient to filter out any joins involving null
values, thus achieving s-joining semantics, and we denote the
program rewritten that way as
.
Obviously, b-joining
semantics is more tricky to achieve, since we now have to relax
the allowed joins in order to allow null bindings to join
with any other value. We will again achieve this result by
modifying rules (2) and (6') where we first do some variable
renaming and then add respective FILTER expressions to
these rules.
Step 1. We rename each variable in the respective rule bodies to
or , respectively, in
order to disambiguate the occurrences originally from sub-pattern
or , respectively. That
is, for each rule (2) or (6'), we rewrite the body to:
Step 2. We now add the following FILTER
expressions and
,
respectively, to the resulting rule bodies which ``emulate'' the
relaxed b-compatibility:
The rewritten rules are again subject to the rewriting. Note
that, strictly speaking the filter expression introduced here
does not fulfill the assumption of safe filter expressions,
since it creates new bindings for the variable . However, these can
safely be allowed here, since the translation only creates
valid input/output term bindings for the external Datalog
predicate ' '. The subtle
difference between and
lies in the
fact that preferably
``carries over'' bound values from or to
whereas
always takes
the value of . The effect
of this becomes obvious in the translation of Example 5 which we leave as an exercise to the reader. We
note that the potential exponential (with respect to ) blowup
of the program size by unfolding the filter expressions into
negation normal form during the rewriting13is not surprising, given the
negative complexity results in [16]. In total, we
obtain a program which
which reflects the normative
b-joining semantics. Consequently, we get sound and complete
query translations for all three semantics:
In the following, we will drop the superscript in
implicitly refer to the
normative b-joining translation/semantics.
5 Possible Extensions
As it turns out, the embedding of SPARQL in the rules world
opens a wide range of possibilities for combinations. In this
section, we will first discuss some straightforward extensions of
SPARQL which come practically for free with the translation to
Datalog provided before. We will then discuss the use of SPARQL
itself as a simple RDF rules language14 which allows to
combine RDF fact bases with implicitly specified further facts
and discuss the semantics thereof briefly. We conclude this
section with revisiting the open issue of entailment regimes
covering RDFS or OWL semantics in SPARQL.
As mentioned
before, set difference is not present in the current SPARQL
specification syntactically, though hidden, and would need to be
emulated via a combination of OPTIONAL and FILTER
constructs. As we defined the MINUS operator here in a
completely modular fashion, it could be added straightforwardly
without affecting the semantics definition.
Nested
queries are a distinct feature of SQL not present in SPARQL. We
suggest a simple, but useful form of nested queries to be added:
Boolean queries
with an empty result form (denoted by the keyword ASK)
can be safely allowed within FILTER expressions as an easy
extension fully compatible with our translation. Given query
, with
sub-pattern
we can modularly translate such subqueries by extending
with where
. Moreover, we have to rename predicate names
to
in . Some
additional considerations are necessary in order to combine this
within arbitrary complex filter expressions, and we probably need
to impose well-designedness for variables shared between
and
similar to Def. 4. We leave more
details as future work.
5.2 Result Forms and Solution Modifiers
We have covered only
SELECT queries so far. As shown in the previous section,
we can consider ASK queries equally. A limited form of the
CONSTRUCT result form, which allows to construct new
triples could be emulated in our approach as well. Namely, we can
allow queries of the form
where
is a
graph pattern consisting only of bNode-free triple patterns. We
can model these by adding a rule
: -
|
(2) |
to
for each triple in
. The
result graph is then naturally represented in the answer set of
the program extended that way in the extension of the predicate
.
As it turns out with the extensions defined in
the previous subsections, SPARQL itself may be viewed as an
expressive rules language on top of RDF. CONSTRUCT
statements have an obvious similarity with view definitions in
SQL, and thus may be seen as rules themselves.
Intuitively, in the translation of CONSTRUCT we
``stored'' the new triples in a new triple outside the dataset
. We can imagine a
similar construction in order to define the semantics of queries
over datasets mixing such CONSTRUCT statements with RDF
data in the same turtle file.
Let us assume such a mixed file containing CONSTRUCT
rules and RDF triples web-accessible at IRI , and a query
, with
. The
semantics of a query over a dataset containing may then be defined by
recursively adding
to
for any CONSTRUCT query
in
plus the rules
(2) above with their head changed to
. We further need to add a rule
:
-
for each , in
order not to omit any of the implicit triples defined by such
``CONSTRUCT rules''. Analogously to the considerations for
nested ASK queries, we need to rename the
predicates and
constants in every subprogram
defined this way.
Naturally, the resulting programs possibly involve recursion,
and, even worse, recursion over negation as failure. Fortunately,
the general answer set semantics, which we use, can cope with
this. For some important aspects on the semantics of such
distributed rules and facts bases, we refer to [17], where we also
outline an alternative semantics based on the well-founded
semantics. A more in-depth investigation of the complexity and
other semantic features of such a combination is on our
agenda.
The current SPARQL specification does not treat
entailment regimes beyond RDF simple entailment. Strictly
speaking, even RDF entailment is already problematic as a basis
for SPARQL query evaluation; a simple query pattern like
would have infinitely many solutions even on the empty (sic!)
dataset by matching the infinitely many axiomatic triples in the
RDF(S) semantics.
Finite rule sets which approximate the RDF(S) semantics in
terms of positive Datalog rules [17] have been
implemented in systems like TRIPLE15 or JENA16.
Similarly, fragments and extensions of OWL [12,3,14] definable in terms of
Datalog rule bases have been proposed in the literature. Such
rule bases can be parametrically combined with our translations,
implementing what one might call RDFS or OWL
entailment at least. It remains to be seen whether the SPARQL
working group will define such reduced entailment regimes.
More complex issues arise when combining a nonmonotonic query
language like SPARQL with ontologies in OWL. An embedding of
SPARQL into a nonmonotonic rules language might provide valuable
insights here, since it opens up a whole body of work done on
combinations of such languages with ontologies [7,19].
6 Conclusions & Outlook
In this paper, we presented three
possible semantics for SPARQL based on [16] which differ mainly
in their treatment of joins and their translations to Datalog
rules. We discussed intuitive behavior of these different joins
in several examples. As it turned out, the s-joining semantics
which is close to traditional treatment of joins over incomplete
relations and the c-joining semantics are nicely embeddable into
Datalog. The b-joining semantics which reflects the normative
behavior as described by the current SPARQL specification is most
difficult to translate. We also suggested some extension of
SPARQL, based on this translation. Further, we hope to have
contributed to clarifying the relationships between the Query,
Rules and Ontology layers of the Semantic Web architecture with
the present work.
A prototype of the presented translation has been implemented
on top of the dlvhex system, a flexible framework for
developing extensions for the declarative Logic Programming
Engine DLV17. The prototype is available as a
plugin at http://con.fusion.at/dlvhex/.
The web-page also provides an online interface for evaluation,
where the reader can check translation results for various
example queries, which we had to omit here for space reasons. We
currently implemented the c-joining and b-joining semantics and
we plan to gradually extend the prototype towards the features
mentioned in Sec. 5, in order to
query mixed RDF+SPARQL rule and fact bases. Implementation of
further extensions, such as the integration of aggregates typical
for database query language, and recently defined for recursive
Datalog programs in a declarative way compatible with the answer
set semantics [9], are on our agenda.
We are currently not aware of any other engine implementing the
full semantics defined in [16].
Special
thanks go to Jos de Bruijn and Reto Krummenacher for discussions
on earlier versions of this document, to Bijan Parsia, Jorge
Pérez, and Andy Seaborne for valuable email-discussions, to Roman
Schindlauer for his help on prototype implementation on top of
dlvhex, and to the anonymous reviewers for various useful
comments. This work is partially supported by the Spanish MEC
under the project TIC-2003-9001 and by the EC funded projects
TripCom (FP6-027324) and KnowledgeWeb (IST 507482).
- 1
- C. Baral.
Knowledge Representation, Reasoning and Declarative Problem
Solving.
Cambr.Univ. Press, 2003.
- 2
- D. Beckett.
Turtle - Terse RDF Triple Language. Tech. Report, 4 Apr.
2006.
- 3
- J. de Bruijn, A. Polleres, R. Lara, D. Fensel.
OWL DL vs. OWL Flight: Conceptual modeling and reasoning for
the semantic web.
In Proc. WWW-2005, 2005.
- 4
- J. Carroll, C. Bizer, P. Hayes, P. Stickler.
Named graphs.
Journal of Web Semantics, 3(4), 2005.
- 5
- R. Cyganiak.
A relational algebra for sparql.
Tech. Report HPL-2005-170, HP Labs, Sept. 2005.
- 6
- J. de Bruijn, E. Franconi, S. Tessaris.
Logical reconstruction of normative RDF.
OWL: Experiences and Directions Workshop (OWLED-2005),
2005.
- 7
- T. Eiter, G. Ianni, A. Polleres, R. Schindlauer, H.
Tompits.
Reasoning with rules and ontologies.
Reasoning Web 2006, 2006. Springer
- 8
- T. Eiter, G. Ianni, R. Schindlauer, H. Tompits.
A Uniform Integration of Higher-Order Reasoning and External
Evaluations in Answer Set Programming.
Int.l Joint Conf. on Art. Intelligence (IJCAI),
2005.
- 9
- W. Faber, N. Leone, G. Pfeifer.
Recursive aggregates in disjunctive logic programs: Semantics
and complexity.
Proc. of the 9th European Conf. on Art. Intelligence (JELIA
2004), 2004. Springer.
- 10
- A. V. Gelder, K. Ross, J. Schlipf.
Unfounded sets and well-founded semantics for general logic
programs.
7
ACM
Symp. on Principles of Database Systems, 1988.
- 11
- M. Gelfond, V. Lifschitz.
Classical Negation in Logic Programs and Disjunctive
Databases.
New Generation Computing, 9:365-385, 1991.
- 12
- B. N. Grosof, I. Horrocks, R. Volz, S. Decker.
Description logic programs: Combining logic programs with
description logics.
Proc. WWW-2003, 2003.
- 13
- P. Hayes.
RDF semantics.
W3C Recommendation, 10 Feb. 2004. http://www.w3.org/TR/rdf-mt/
- 14
- H. J. ter Horst.
Completeness, decidability and complexity of entailment for RDF
Schema and a semantic extension involving the OWL
vocabulary.
Journal of Web Semantics, 3(2), July 2005.
- 15
- J. W. Lloyd, R. W. Topor.
Making prolog more expressive.
Journal of Logic Programming, 1(3):225-240, 1984.
- 16
- J. Pérez, M. Arenas, C. Gutierrez.
Semantics and complexity of SPARQL.
The Semantic Web - ISWC 2006, 2006. Springer.
- 17
- A. Polleres, C. Feier, A. Harth.
Rules with contextually scoped negation.
Proc. 3
European
Semantic Web Conf. (ESWC2006), 2006. Springer.
- 18
- E. Prud'hommeaux, A. S. (ed.).
SPARQL Query Language for RDF,
W3C Working Draft, 4 Oct. 2006. http://www.w3.org/TR/rdf-sparql-query/
- 19
- R. Rosati.
Reasoning with Rules and Ontologies.
Reasoning Web 2006, 2006. Springer.
- 20
- SQL-99.
Information Technology - Database Language SQL- Part 3: Call
Level Interface (SQL/CLI).
Technical Report INCITS/ISO/IEC 9075-3, INCITS/ISO/IEC, Oct.
1999.
Standard specification.
- 21
- J. D. Ullman.
Principles of Database and Knowledge Base
Systems.
Computer Science Press, 1989.
Footnotes
- 1
- An extended technical report of this article is available
at http://www.polleres.net/publications/GIA-TR-2006-11-28.pdf. This work was mainly conducted under a Spanish MEC grant at Universidad Rey Juan Carlos, Móstoles, Spain.
- 2
- http://www.w3.org/2005/rules/wg
- 3
- For reasons of legibility and conciseness, we omit the
leading 'http://' or other schema identifiers in IRIs.
- 4
- Following SPARQL, we are slightly more general than the
original RDF specification in that we allow literals in subject
positions.
- 5
- We do not consider bNodes in patterns as these can be
semantically equivalently replaced by variables in graph
patterns [6].
- 6
- Note that AND and MINUS are not designated
keywords in SPARQL, but we use them here for reasons of
readability and in order to keep with the operator style
definition of [16]. MINUS is
syntactically not present at all, but we will suggest a syntax
extension for this particular keyword in Sec. 5.
- 7
- stands for
``true'', stands for
``false'' and
stands for
errors, see [18, Sec.
11.3] and Example 2 for details.
- 8
- We consider ASP, more precisely a simplified version of ASP
with so-called HEX-programs [8] here, since
it is up to date the most general extension of Datalog.
- 9
- By ``identified'' we mean here that IRIs denote network
accessible resources which correspond to RDF graphs.
- 10
- We assume the number of accessible IRIs finite.
- 11
- The notion of an oracle function reflects the intuition
that external predicates compute (sets of) outputs for a
particular input, depending on the interpretation. The
dependence on the interpretation is necessary for instance for
defining the semantics of external predicates querying OWL
[8] or
computing aggregate functions.
- 12
- In fact, we use slightly simplified definitions from
[7] for
HEX-programs, with the sole difference that we restrict
ourselves to a fixed set of external predicates.
- 13
- Lloyd and Topor can avoid this potential exponential blowup
by introducing new auxiliary predicates. However, we cannot do
the same trick, mainly for reasons of preserving safety of
external predicates as defined in Sec. 3.
- 14
- Thus, the ``...(and back)'' in the title of this
paper!
- 15
- http://triple.semanticweb.org/
- 16
- http://jena.sourceforge.net/
- 17
- http://www.dlvsystem.com/