rdf4j icon indicating copy to clipboard operation
rdf4j copied to clipboard

Issues with FILTER NOT IN clause

Open jetztgradnet opened this issue 2 years ago • 10 comments

Current Behavior

I'm porting an application originally written for Apache Jena to use RDF4J instead.

One unit test runs a query like this (combined example with data plus query):

prefix ex:      <http://example.com/> 
prefix xsd:        <http://www.w3.org/2001/XMLSchema#> 

SELECT * WHERE {
  # Sample correct data for testing
  VALUES (?sub ?obj) {
  	(ex:wrong1 "value1-en"@de) # 1 error
  	(ex:wrong2 "value1-de"@en)  # 1 error
  	(ex:wrong3 "value NOT")  # 1 error
  	(ex:wrong4 "111"^^xsd:integer)  # 1 error
  }
  FILTER( ?obj NOT IN ( 
        "1"^^xsd:integer ,
        "value1-de"@de ,
        "value1-en"@en , 		
        "value"^^xsd:string ))
 }

I would expect that four solutions are returned, but RDF4J only returns two (the two language strings). Interestingly/strangely, commenting the value with the xsd:integer makes the third value (plain xsd:string) also part of the result.

So it looks like there is some issue when comparing values of different data types in the NOT IN filter clause.

As a slightly related issue, running this as NOT IN (two or more whitespaces between NOT and IN produces an error (the query is generated so there is little control over whitespace). Is RDF4J rather strict here or does the grammar really not allow multiple whitespaces here?

prefix ex:      <http://example.com/> 
prefix xsd:        <http://www.w3.org/2001/XMLSchema#> 

SELECT * WHERE {
  # Sample correct data for testing
  VALUES (?sub ?obj) {
  	(ex:wrong1 "value1-en"@de) # 1 error
  }

  # two whitespace between NOT and IN
  FILTER( ?obj NOT   IN ( 
        "value1-en"@en , 
        "value"^^xsd:string ))
 }

Expected Behavior

  • I would expect that all four solutions are returned by the provided query
  • I would expect that I can also use NOT IN with multiple white space instead of just NOT IN

Steps To Reproduce

No response

Version

4.0.0

Are you interested in contributing a solution yourself?

Perhaps?

Anything else?

No response

jetztgradnet avatar May 31 '22 21:05 jetztgradnet

I wonder if this could be related to https://github.com/eclipse/rdf4j/issues/3696 ?

The specification for NOT IN says that it should be equivalent to != so that ?a NOT IN (1,2,3) should be the same as writing (?a != 1) && (?a != 2) && (?a != 3). Does this produce the same result for you?

As for the whitespace I'm not sure how strict we should be.

Edit: According to the grammar specification the amount of white space between NOT and IN should not matter.

hmottestad avatar Jun 01 '22 06:06 hmottestad

Regarding using !=: this is a generated query, so I didn't want to change the original templates producing this query... But I added a version of this to the unit test below (testFilter_NotEquals_withMixedDatatypes()) and to my surprise it didn't change the behavior, it still only returns two solutions.

A unit test illustrating the two issues:

import static org.junit.Assert.assertEquals;

import org.eclipse.rdf4j.query.TupleQuery;
import org.eclipse.rdf4j.query.TupleQueryResult;
import org.eclipse.rdf4j.repository.Repository;
import org.eclipse.rdf4j.repository.RepositoryConnection;
import org.eclipse.rdf4j.repository.sail.SailRepository;
import org.eclipse.rdf4j.sail.memory.MemoryStore;
import org.junit.Test;

public class FilterTest {

    @Test
    public void testFilter_NotIn_withMixedDatatypes() {
        Repository repo = new SailRepository(new MemoryStore());
        try (RepositoryConnection conn = repo.getConnection()) {
            String queryString = "prefix ex:      <http://example.com/> \n"
                    + "prefix xsd:        <http://www.w3.org/2001/XMLSchema#> \n"
                    + "\n"
                    + "SELECT * WHERE {\n"
                    + "  # Sample correct data for testing\n"
                    + "  VALUES (?sub ?obj) {\n"
                    + "    (ex:wrong1 \"value1-en\"@de) # 1 error\n"
                    + "    (ex:wrong2 \"value1-de\"@en)  # 1 error\n"
                    + "    (ex:wrong3 \"value NOT\")  # 1 error\n"
                    + "    (ex:wrong4 \"111\"^^xsd:integer)  # 1 error\n"
                    + "  }\n"
                    + "  FILTER( ?obj NOT IN ( \n"
                    + "        \"1\"^^xsd:integer ,\n"
                    + "        \"value1-de\"@de ,\n"
                    + "        \"value1-en\"@en ,        \n"
                    + "        \"value\"^^xsd:string ))\n"
                    + " }";
            TupleQuery tupleQuery = conn.prepareTupleQuery(queryString);
            try (TupleQueryResult result = tupleQuery.evaluate()) {
                int count = 0;
                while (result.hasNext()) { // iterate over the result
                    result.next();
                    count++;
                }

                assertEquals("Should have four solutions", 4, count);
            }
        }
    }
    
    @Test
    public void testFilter_NotEquals_withMixedDatatypes() {
        Repository repo = new SailRepository(new MemoryStore());
        try (RepositoryConnection conn = repo.getConnection()) {
            String queryString = "prefix ex:      <http://example.com/> \n"
                    + "prefix xsd:        <http://www.w3.org/2001/XMLSchema#> \n"
                    + "\n"
                    + "SELECT * WHERE {\n"
                    + "  # Sample correct data for testing\n"
                    + "  VALUES (?sub ?obj) {\n"
                    + "    (ex:wrong1 \"value1-en\"@de) # 1 error\n"
                    + "    (ex:wrong2 \"value1-de\"@en)  # 1 error\n"
                    + "    (ex:wrong3 \"value NOT\")  # 1 error\n"
                    + "    (ex:wrong4 \"111\"^^xsd:integer)  # 1 error\n"
                    + "  }\n"
                    + "  FILTER( (\n"
                    + "        (?obj != \"1\"^^xsd:integer)\n"
                    + "     && (?obj != \"value1-de\"@de)\n"
                    + "     && (?obj != \"value1-en\"@en)        \n"
                    + "     && (?obj != \"value\"^^xsd:string )))\n"
                    + " }";
            TupleQuery tupleQuery = conn.prepareTupleQuery(queryString);
            try (TupleQueryResult result = tupleQuery.evaluate()) {
                int count = 0;
                while (result.hasNext()) { // iterate over the result
                    result.next();
                    count++;
                }

                assertEquals("Should have four solutions", 4, count);
            }
        }
    }
    
    @Test
    public void testFilter_NotIn_withSingleWhitespace() {
        Repository repo = new SailRepository(new MemoryStore());
        try (RepositoryConnection conn = repo.getConnection()) {
            String queryString = ""
                    + "prefix ex:  <http://example.com/> \n"
                    + "prefix xsd: <http://www.w3.org/2001/XMLSchema#> \n"
                    + "\n"
                    + "SELECT * WHERE {\n"
                    + "  # Sample correct data for testing\n"
                    + "  VALUES (?sub ?obj) {\n"
                    + "    (ex:wrong1 \"value1-en\"@de) # 1 error\n"
                    + "  }\n"
                    + "\n"
                    + "  # two whitespace between NOT and IN\n"
                    + "  FILTER( ?obj NOT IN ( \n"
                    + "        \"value1-en\"@en , \n"
                    + "        \"value\"^^xsd:string ))\n"
                    + "}";
            TupleQuery tupleQuery = conn.prepareTupleQuery(queryString);
            try (TupleQueryResult result = tupleQuery.evaluate()) {
                int count = 0;
                while (result.hasNext()) { // iterate over the result
                    result.next();
                    count++;
                }

                assertEquals("Should have one solutions", 1, count);
            }
        }
    }
    
    @Test
    public void testFilter_NotIn_withMultipleWhitespace() {
        Repository repo = new SailRepository(new MemoryStore());
        try (RepositoryConnection conn = repo.getConnection()) {
            // the only differences to the previous test case
            // are additional whitespace between NOT and IN
            String queryString = ""
                    + "prefix ex:  <http://example.com/> \n"
                    + "prefix xsd: <http://www.w3.org/2001/XMLSchema#> \n"
                    + "\n"
                    + "SELECT * WHERE {\n"
                    + "  # Sample correct data for testing\n"
                    + "  VALUES (?sub ?obj) {\n"
                    + "    (ex:wrong1 \"value1-en\"@de) # 1 error\n"
                    + "  }\n"
                    + "\n"
                    + "  # two whitespace between NOT and IN\n"
                    + "  FILTER( ?obj NOT   IN ( \n"
                    + "        \"value1-en\"@en , \n"
                    + "        \"value\"^^xsd:string ))\n"
                    + "}";
            TupleQuery tupleQuery = conn.prepareTupleQuery(queryString);
            try (TupleQueryResult result = tupleQuery.evaluate()) {
                int count = 0;
                while (result.hasNext()) { // iterate over the result
                    result.next();
                    count++;
                }

                assertEquals("Should have one solutions", 1, count);
            }
        }
    }

}

jetztgradnet avatar Jun 01 '22 07:06 jetztgradnet

BTW, this behaves the same in GraphDB (which is based on RDF4J) and Blazegraph, but not in Stardog where all four solutions are returned.

jetztgradnet avatar Jun 01 '22 07:06 jetztgradnet

Thanks for the thorough report @jetztgradnet . Unless someone else picks it up first, I'll try and take a closer look over the weekend.

abrokenjester avatar Jun 01 '22 20:06 abrokenjester

Query execution plan for the original query:

Projection (resultSizeActual=2)
╠══ProjectionElemList
║     ProjectionElem "sub"
║     ProjectionElem "obj"
╚══Filter (resultSizeActual=2)
   ├──And
   │  ╠══Compare (!=)
   │  ║     Var (name=obj)
   │  ║     ValueConstant (value="value")
   │  ╚══Compare (!=)
   │        Var (name=obj)
   │        ValueConstant (value="value1-en"@en)
   └──Filter (resultSizeActual=3)
      ╠══And
      ║  ├──Compare (!=)
      ║  │     Var (name=obj)
      ║  │     ValueConstant (value="value1-de"@de)
      ║  └──Compare (!=)
      ║        Var (name=obj)
      ║        ValueConstant (value="1"^^<http://www.w3.org/2001/XMLSchema#integer>)
      ╚══BindingSetAssignment ([[sub=http://example.com/wrong1;obj="value1-en"@de], [sub=http://example.com/wrong2;obj="value1-de"@en], [sub=http://example.com/wrong3;obj="value NOT"], [sub=http://example.com/wrong4;obj="111"^^<http://www.w3.org/2001/XMLSchema#integer>]]) (resultSizeActual=4)

Looks fairly clear that the filter evaluation is somehow tripping up. I have also doublechecked that it is not the VALUES clause causing the problem: the same result occurs when adding triples to the store and using a basic graph pattern to query.

abrokenjester avatar Jun 04 '22 22:06 abrokenjester

Stepping through it, it fails on comparisons between datatyped and untyped literals:

"org.eclipse.rdf4j.query.algebra.evaluation.ValueExprEvaluationException: Unable to compare strings with other supported types" This happens when comparing "value NOT" with "1"^^xsd:integer.

abrokenjester avatar Jun 04 '22 23:06 abrokenjester

Oh here we go. It's one of those cases where the SPARQL spec become near-impenetrable.

The case we have here is that we are comparing two literals: "value NOT" with "1"^^xsd:integer, using the != comparison. Both are literals, but with incompatible datatypes (string, and integer). In the operator mapping table (see https://www.w3.org/TR/sparql11-query/#OperatorMapping), the only mapping that applies is therefore comparison using RDF term-equality (the last row in the table):

A != B RDF term RDF term fn:not(RDFterm-equal(A, B)) xsd:boolean

RDFterm-equal (see https://www.w3.org/TR/sparql11-query/#func-RDFterm-equal) is defined as follows:

Returns TRUE if term1 and term2 are the same RDF term as defined in Resource Description Framework (RDF): Concepts and Abstract Syntax [CONCEPTS]; produces a type error if the arguments are both literal but are not the same RDF term *; returns FALSE otherwise. term1 and term2 are the same if any of the following is true:

(emphasis mine)

Note that in our case, when applying the specs strictly, RDFterm-equal is supposed to return a type error: we have two literals, but they are not the same RDF term (as they are not equivalent literals as defined in the linked section in RDF Concepts).

So, term-equality under a strict interpretation results in a type error. You'd think that because we negate this that maybe then gets coerced to an actual true value (as in : they are not equal because we get a type error), but you'd be wrong. Going back to section 17.3 (operator mapping), we find this:

[...] Some of the operators are associated with nested function expressions, e.g. fn:not(op:numeric-equal(A, B)). Note that per the XPath definitions, fn:not and op:numeric-equal produce an error if their argument is an error. [...]

In other words the negation of a type error is also a type error.

So, in the strict interpretation of the SPARQL 1.1 spec, RDF4J is actually giving you the correct answer here.

Does that mean Jena and Stardog are wrong? Well no. As per section 17.3.1 (operator extensibility):

SPARQL language extensions may provide additional associations between operators and operator functions; this amounts to adding rows to the table above. No additional operator may yield a result that replaces any result other than a type error in the semantics defined above. The consequence of this rule is that SPARQL FILTERs will produce at least the same intermediate bindings after applying a FILTER as an unextended implementation.

So what Jena and Stardog do is extend the minimal compliance in a way that the spec allows for. We can do a similar thing in RDF4J (in fact we already do for some other cases). It's fine to extend operator behavior as long as the only cases you touch were type errors, previously), but I need to take a look at how to best fit this in: if we should tweak the strict evaluation strategy itself, or if we should consider this another case for the extended evaluation strategy to handle.

abrokenjester avatar Jun 05 '22 00:06 abrokenjester

Thanks, Jeen, for the thorough investigation! My naive expectation here would be that I want to exclude a set of things, but the thing we are looking at is not that thing, so it may be returned after all. Whether the checked term is valid based in its datatype shouldn't matter here IMHO.

But anyway, whatever you consider the right path to follow for RDF4J is ok for me, I will then adjust the unit tests of the application ported from Jena accordingly.

Any suggestion about the error for multiple whitespaces between NOT and IN (i.e. "NOT IN" as opposed to "NOT IN" which comes from a generated query where the NOT is inserted conditionally and is wrapped in whitespace)?

jetztgradnet avatar Jun 05 '22 00:06 jetztgradnet

Thanks, Jeen, for the thorough investigation! My naive expectation here would be that I want to exclude a set of things, but the thing we are looking at is not that thing, so it may be returned after all. Whether the checked term is valid based in its datatype shouldn't matter here IMHO.

Oh I fully agree that that is how it should work. It's just that a minimally-conforming implementation of the SPARQL spec doesn't :) . But minimally-conforming is kind of useless to stick to, beyond corner cases like strict validation and/or ensuring that you're 100% sure your query will work on any compliant SPARQL engine.

But anyway, whatever you consider the right path to follow for RDF4J is ok for me, I will then adjust the unit tests of the application ported from Jena accordingly.

Any suggestion about the error for multiple whitespaces between NOT and IN (i.e. "NOT IN" as opposed to "NOT IN" which comes from a generated query where the NOT is inserted conditionally and is wrapped in whitespace)?

I'll take a look at that. It's a bug (additional whitespace should be ignored), but I'll split it out from this ticket, since it's really a separate issue.

abrokenjester avatar Jun 05 '22 21:06 abrokenjester

I think we may need to reclassify this as an improvement / feature request rather than a bug. I've picked up a related refactoring issue (GH-635 ) that aims to make choosing the mode in which the query engine runs a little easier. Haven't yet decided if I want to finish that first or in parallel just add this feature into the current (somewhat flawed) setup (probably in the ExtendedEvaluationStrategy).

abrokenjester avatar Jun 07 '22 02:06 abrokenjester