SPARQL-star queries perform poorly
Version
4.3.2
What happened?
I evaluated JenaTDB2 4.3.2 with a SPARQL-star dataset with 9.411.041 triples (4.9 GB). I loaded the dataset with the tdb2 loader as on-disk storage. Then I tried a set of rather complex rdf-star queries only to find that none of them was able to finish. I went on to pin down the issue and tried a very simple SPARQL-star query that contains only one nested quoted triple statement:
select * { <<<<?s ?p ?o>> ?a ?b >> ?x ?y. }
However, even this query took about 10min (639 sec)
15:47:30 INFO Server :: Apache Jena Fuseki 4.3.2
15:47:30 INFO Config :: FUSEKI_HOME=/opt/apache_jena_fuseki/apache-jena-fuseki-4.3.2
15:47:30 INFO Config :: FUSEKI_BASE=/home/fkovacev/.jena/fuseki-4.3.2
15:47:30 INFO Config :: Shiro file: file:///home/fkovacev/.jena/fuseki-4.3.2/shiro.ini
15:47:31 INFO Config :: Load configuration: file:///home/fkovacev/.jena/fuseki-4.3.2/configuration/bearc_tb_sr_rs.ttl
15:47:31 INFO Server :: Configuration file: /home/fkovacev/.jena/fuseki-4.3.2/config.ttl
15:47:31 INFO Server :: Path = /bearc_tb_sr_rs
15:47:31 INFO Server :: System
15:47:31 INFO Server :: Memory: 4.0 GiB
15:47:31 INFO Server :: Java: 17.0.5
15:47:31 INFO Server :: OS: Linux 5.15.0-58-generic amd64
15:47:31 INFO Server :: PID: 106127
15:47:31 INFO Server :: Started 2023/02/03 15:47:31 CET on port 3030
15:47:52 INFO Fuseki :: [5] POST http://localhost:3030/bearc_tb_sr_rs/sparql
15:47:52 INFO Fuseki :: [5] Query = select * { <<<<?s ?p ?o>> ?a ?b >> ?x ?y. }
15:58:31 INFO Fuseki :: [5] 200 OK (638.810 s)
The memory didn't seem to be the problem.
I tried the same set of queries on GraphDB and they all needed only a few seconds. Is it possible that Jena generally performs poorly with SPARQL-star and even worse if there are multiple nesting levels?
Relevant output and stacktrace
No response
Are you interested in making a pull request?
None
Yes, it is possible.
The Jena current support for RDF-star has not made any changes to the on-disk datastructures except for adding the new RDF term type. This enables people to try RDF-star without disrupting their other databases or needing multiple versions of the code on their systems.
The RDF-star Working Group has started. I'd appreciate understanding what is the use case for nested quoted triples?
The use case is timestamped-based versioning of RDF datasets using RDF-star and SPARQL-star. As part of my research, I made an API that lets you update RDF triples and issue SPARQL queries by automatically transforming them into RDF-star triples and SPARQL-star queries with timestamps attached. The RDF-star triples look like this:
<< << <http://example.com/s> <http://example.com/p> "o">> :valid_from "2023-02-06T12:00:00""^^xsd:datetime >> :valid_until "9999-12-31T12:00:00"^^xsd:datetime .
Using two nesting levels, I can attach a creation and deletion timestamp.
I also tried the more intuitive and semantically correct approach:
<< <http://example.com/s> <http://example.com/p> "o">> :valid_from "2023-02-06T12:00:00""^^xsd:datetime .
<< <http://example.com/s> <http://example.com/p> "o">> :valid_until "9999-12-31T12:00:00"^^xsd:datetime .
However, with this approach the datasets are bigger due to the redundancy (repetition of the data triple) and the query performance is worse in GraphDB 9.3 and JenaTDB 4.3.2. So I decided to go with the nested quoted triple.
More infos: API: https://github.com/GreenfishK/starvers Evaluation of the RDF-star and timestamp-based versioning approach: https://github.com/GreenfishK/starvers_eval (still ongoing) Paper submitted to SWJ: http://semantic-web-journal.org/content/starvers-versioning-and-timestamping-rdf-data-means-rdf-approach-based-annotated-triples (major revision going to be submitted soon)