Slow performance on `GRAPH_GROUP`
-
I created this graph group:
DB.DBA.RDF_GRAPH_GROUP_DROP('http://www.batch0.fr/', 0); DB.DBA.RDF_GRAPH_GROUP_CREATE('http://www.batch0.fr/',0); DB.DBA.RDF_GRAPH_USER_PERMS_SET ('http://www.batch0.fr/', 'nobody', 9); DB.DBA.RDF_GRAPH_GROUP_INS('http://www.batch0.fr/', 'http://www.vendor0.fr/'); DB.DBA.RDF_GRAPH_GROUP_INS('http://www.batch0.fr/', 'http://www.vendor1.fr/'); DB.DBA.RDF_GRAPH_GROUP_INS('http://www.batch0.fr/', 'http://www.vendor2.fr/'); DB.DBA.RDF_GRAPH_GROUP_INS('http://www.batch0.fr/', 'http://www.vendor3.fr/'); DB.DBA.RDF_GRAPH_GROUP_INS('http://www.batch0.fr/', 'http://www.vendor4.fr/'); -
I executed this query, it takes forever when it should be instantaneous:
SELECT COUNT(*) FROM <http://www.batch0.fr/> WHERE { ?s ?p ?o }
What is the Virtuoso version you are using as this works for me querying from the SPARQL endpoint or isql, with the latest develop/7 build ?
I use v7.2.12
The output of virtuoso-tis:
Version 7.2.12.3239-pthreads as of Feb 13 2024 (d698f21712)
Compiled for Linux (x86_64-alpine-linux-gnu)
Copyright (C) 1998-2024 OpenLink Software
I also include in the link below the dump of the database I use (virtuoso.db + virtuoso.ini): https://drive.google.com/file/d/1lAlzAkr6Vy3BZZGjf59padrTXaffDoNj/view?usp=sharing
In your test case, you only had 4 graphs in the graph group, with no data inserted in any of the graphs. Whereas in the database provided, there are 20 graphs in the graph group, with a total of 3M+ triples across all the graphs.
Graph groups does not scale in Virtuoso Open Source, as the query across the graph group gets compiled as SELECT ... G IN () resulting in multiple join condition tests, which is a very time consuming operation to perform serially on every row, and so will not scale. The Virtuoso 8.x Commercial Edition implements a new invisible hash join algorithm, which would compile such queries as a hash IN join that runs in parallel, and is thus more performant and scalable.
Thank you for your insight!
The workaround is to ingest graph data of the same group into separate Virtuoso databases and execute the queries accordingly. Will the implementation be ported to Virtuoso Open Source at some point?
There are no plans for the invisible hash join feature to be ported to open source, it is a commercial only feature.