exist-sparql icon indicating copy to clipboard operation
exist-sparql copied to clipboard

Not all data is correctly indexed/returned

Open adamretter opened this issue 5 years ago • 1 comments

Using the following RDF/XML data file - http://static.adamretter.org.uk/HHS_Provider_Relief_Fund.rdf.gz

I can't seem to ever get more than 10 results back from querying it with SPARQL in eXist-db:

xquery version "3.1";

import module namespace sparql = "http://exist-db.org/xquery/sparql";

let $query1 := '
    PREFIX ds:  <https://data.cdc.gov/resource/kh8y-3es6/>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

    SELECT (count(DISTINCT ?state) as ?count)
    WHERE {
        ?provider ds:state ?state
    }
'
return
	sparql:query($query1)

returns the count of 10, i.e.:

<sparql xmlns="http://www.w3.org/2005/sparql-results#">
    <head>
        <variable name="count"/>
    </head>
    <results>
        <result>
            <binding name="count">
                <literal datatype="http://www.w3.org/2001/XMLSchema#integer">10</literal>
            </binding>
        </result>
    </results>
</sparql>

However the XQuery on RDF/XML shows that the result should actually be 55:

count(distinct-values(doc("/db/hhs-provider/hhs-provider.rdf")//*:state/string(.)))

The result from the SPARQL query (10) is wrong, the XQuery result of 55 is correct.

adamretter avatar Jun 24 '20 22:06 adamretter

I also decided to test this directly with TDB from Apache Jena 3.15.0

I loaded the data:

$ bin/tdbloader --loc=/tmp/tdb /tmp/HHS_Provider_Relief_Fund.rd

...

** Completed: 1,471,085 triples loaded in 18.07 seconds [Rate: 81,396.84 per second]

I created the SPARQL file /tmp/states.sparql:

PREFIX ds:  <https://data.cdc.gov/resource/kh8y-3es6/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT (count(DISTINCT ?state) as ?count)
WHERE {
   ?provider ds:state ?state
}

I then executed the SPARQL query:

$ bin/tdbquery --loc=/tmp/tdb --file /tmp/states.sparql
---------
| count |
=========
| 55    |
---------

So using TDB directly returns the correct result - therefore I have to suspect some bug somewhere in the exist-sparql module.

adamretter avatar Jun 24 '20 22:06 adamretter