jena Lucene index is updated incorrectly during some dataset changes

Version

4.9.0

What happened?

Hi folks!

I'd like to offer text search on one of my Fuseki read-write datasets and noticed some irregularities when changing triples in the underlying dataset.

Setup:

Fuseki Standalone Server
config: config-text-tdb2.ttl
start Fuseki: ./fuseki-server --config=config-text-tdb2.ttl
data: dfgfo.ttl

Query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX text: <http://jena.apache.org/text#>

SELECT ?uri ?score
WHERE {
  (?uri ?score) text:query (rdfs:label 'ancient history') .
}
ORDER BY DESC(?score)

Add dfgfo.ttl to the dataset via add data in the UI (POST http://127.0.0.1:3030/dataset/data); we have 1126 triples
Execute query: 14 results
Repeat 1.
Execute query: 14 results (OK!)
Edit the dataset in the UI and save it without changes (PUT http://127.0.0.1:3030/dataset/data?graph=default); we still have 1126 triples
Execute query: 28 results (14 duplicates)
Send DROP ALL to /dataset/update; 0 triples now
Execute query: 28 results
Repeat 1.; we have 1126 triples again
Execute query: 42 results (14+14 duplicates)

Relevant output and stacktrace

No response

Are you interested in making a pull request?

None

Aug 04 '23 17:08 flange-ipb

Hi @flange-ipb,

Thank you for the clear description and the details.

Is it only operations that completely clear the default graph? If you delete a few triples that are indexed, does the number of text index results remain the same?

Aug 05 '23 08:08 afs

Hello @afs,

the problem seems to be only with operations that clear the graph. If selected triples are deleted, then the index is updated.

A more careful analysis: (Note: Between each of the tests I stopped Fuseki and deleted the TDB2 dataset, the Lucene index and the run directories.)

Graph Store HTTP Protocol

POST http://127.0.0.1:3030/dataset/data: Looks good.
- POST data to an empty dataset
- POST the same data to an existing dataset
- POST new triples to an existing dataset
- POST new and existing triples to an existing dataset
- POST empty turtle file to an existing dataset
PUT http://127.0.0.1:3030/dataset/data:
- PUT data on an empty dataset: Looks good.
- PUT the same data to an existing dataset: Not good. Adds the "new" triples to the text index and doesn't remove the "old" ones.
- PUT new triples to an existing dataset: Not good. Adds the new triples to the text index, but doesn't remove the old ones.
- PUT new and existing triples to an existing dataset: Not good. Adds the new and the overwritten triples to the text index.
- PUT empty turtle file to an existing dataset: Not good. Doesn't remove the triples from the text index.
DELETE http://127.0.0.1:3030/dataset/data?default: Not good. Doesn't remove the triples from the text index.

SPARQL Update (POST http://127.0.0.1:3030/dataset/update)

INSERT: Looks good.
- INSERT data into an empty dataset (PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX dfgfo: <https://github.com/tibonto/dfgfo/>INSERT DATA { dfgfo:101-03 rdfs:label "Ancient History"@en .})
- INSERT set same data into an existing dataset (same SPARQL Update query as before)
- INSERT new triples to an existing dataset (PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX dfgfo: <https://github.com/tibonto/dfgfo/>INSERT DATA { dfgfo:102-01 rdfs:label "Medieval History"@en .})
- INSERT new and existing triples into an existing dataset (PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX dfgfo: <https://github.com/tibonto/dfgfo/>INSERT DATA { dfgfo:101-03 rdfs:label "Ancient History"@en . dfgfo:102-01 rdfs:label "Medieval History"@en .})
DELETE:
- I prepared the dataset by loading dfgfo.ttl, then executed PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX dfgfo: <https://github.com/tibonto/dfgfo/>DELETE DATA { dfgfo:101-03 rdfs:label "Ancient History"@en .}: Looks good. The triple is removed from the text index.
LOAD (LOAD <file:///path/to/file.ttl>): This repeats what I did with GSP POST and PUT. Looks good.
- LOAD data into an empty dataset
- LOAD the same data into an existing dataset
- LOAD new triples into an existing dataset
- LOAD new and existing triples into an existing dataset
- LOAD empty turtle file into an existing dataset
CLEAR:
- CLEAR default: Not good. Same behaviour like GSP DELETE.
DROP:
- DROP default: Not good. Same behaviour like GSP DELETE.
COPY: I added <#entMap> text:graphField "graph" . to config-text-tdb2.ttl, loaded some data into a named graph and then copied it into the default graph (COPY <http://example.org/dfgfo> TO DEFAULT). This emulates what I did with GSP POST and PUT.
- COPY into empty default graph: Looks good.
- COPY the same data into non-empty default graph: Not good. Adds the "new" triples to the text index and doesn't remove the "old" ones (like GSP PUT).
- COPY new triples into non-empty default graph: Not good. Adds the "new" triples to the text index and doesn't remove the "old" ones.
- COPY new and existing triples into non-empty default graph: Not good. Adds the new and the overwritten triples to the text index and doesn't remove the "old" ones.
- COPY data from empty named graph into non-empty default graph: Impossible, because Jena doesn't record empty named graphs.
- COPY data from empty default graph into non-empty named graph (COPY DEFAULT TO <http://example.org/dfgfo>), then run SPARQL SELECT to check the text index on named graph: Looks good. Empty named graph does not exist and Lucene index is empty (I checked the Lucene directory.)
MOVE: ...
ADD: ...

Alright, I'm about to loose my mind with manual testing. If necessary, I can contribute some unit tests. I just need some help to get started - let's say a TDB2 dataset (can you run GSP operations on it programmatically?) with a Lucene index or an embedded Fuseki with the right dataset configuration.

Aug 07 '23 11:08 flange-ipb

Hi @flange-ipb - that's very useful. I have an idea of which code path is involved.

By the way: COPY does leave the original triples behind.

Aug 07 '23 15:08 afs

looks like clear from DatasetGraphWrapper isn't overridden in its extending class DatasetGraphTextMonitor , right?

Aug 10 '23 07:08 LorenzBuehmann

Yes - it doesn't loop back to DatasetGraphTextMonitor.deleteAny (this is a general consequence of wrappers).

Aug 10 '23 11:08 afs

Yes

Yes, but.

DROP and CLEAR default operate on graphs, not the whole dataset. They use GraphView.clear (default graph) or DatasetGraph.removeGraph (drop named graph).

They don't go via DatasetGraphText.clear although that does need intercepting.

Aug 10 '23 17:08 afs

Not as simple!

It seems that the graph mgt operations (e.g. CLEAR) don't work but DELETE DATA and DELETE WHERE do.

A CLEAR does affect the dataset. It's the index that isn't changing.

Aug 17 '23 12:08 afs