Deletion of large number of graphs
In my Virtuoso database instance I have around 30k graphs that I want to delete. I've search for an efficient solution but didn't find any so far. Trying to delete so many graphs leads to big performance problems. It would take less time to restart the database from scratch and re-ingest all the 30k graphs.
I tried with different logics like:
DELETE FROM rdf_quad WHERE g IN (iri_to_id('graph1', 'graph2')
SPARQL CLEAR GRAPH <graph1>;
SPARQL CLEAR GRAPH <graph2>;
In my opinion, the main issue is that when some graphs are removed from the rdf_quad then a re-indexing is triggered and this takes the majority of the computational time. I tried to remove the indexes associated with the table rdf_quad but this was creating inconsistencies.
The best approach would be to first remove all the graphs from the rdf_quad table (even in batches it should work), then at the end of the removal we trigger the re-indexing. By doing so we can reduce the overhead be the re-indexing and only execute it the least amount of times possible. It's fine if the database is not accessible during the deletion.
Do you have any suggestion on how this could be achieved? Or other any recommendations on how to remove a large number of graphs from the database? Thank you.
How many triples are in these 30K graphs ?
Are you deleting the graphs with transaction log mode to set to autocommit on, i.e., with log_enable(3,1) as detailed in this document on How can I delete graphs containing large numbers of triples from the Virtuoso Quad Store?
Are you ultimately seeking to delete and reload these 30K graphs, as you indicate above It would take less time to restart the database from scratch and re-ingest all the 30k graphs? If so, note that Virtuoso supports a Delta-aware bulk loading of datasets into Virtuoso option, if the datasets are in nquad format for optimal loading of such data.
That was an example, the number and size of the graphs can vary, but I would say that it is possible to have in total up to billions of triples.
Currently, we are completely disabling it with log_enable(2) and then reactivating it after the load, log_enable(1). Setting log_enable(3,1) will not slow down the process?
Yes, the goal is that there might be some graphs that needs to be removed and then reloaded into the database. The type of files loaded might be turtle, trig, or nquads.
Thank you.
As detailed in the log_enable documentation:
log_enable(2) — turns off transaction logging and turns on auto-commit mode
log_enable(3) — turns on transaction logging and turns on auto-commit mode
So, log_enable(2) would be faster as it does not write to the transaction log file, but then in the event of failure for some reason, the operations performed are lost, which is not the case with log_enable(3) as the transaction log is will replay the operations on database startup.
The key is to not be in manual commit mode which can quickly deplete/consume memory when deleting large graphs.