sparql.anything Caching system

The cache is wiped at every query execution as it is stored as a hash map within the execution context. Therefore, the cache is effective only when the same resource is queried multiple times in the same query (e.g. multiple service clauses having as the same resource as location, same properties and same sub-operation). This makes me question the usefulness of the cache.

An alternative could be decoupling the caching from the execution context, and initialising the cache once at the startup of the system.

Jun 06 '23 14:06 luigi-asprino

This makes me question the usefulness of the cache.

In the case of nested service clauses, this is sufficient to avoid multiple re-engineering of the same file, the RDF is kept in memory, and subsequent queries are performed on the cache.

The cache is indeed per-execution based; this makes it less potent in a server setting (Fuseki runnable).

In the case of the CLI, there is one execution only (except when queries are parametrized).

We need to verify what happens with parametrised queries; in that case, there are multiple executions within the same runtime, and we should check if the cache is brought over or wiped.

Jun 06 '23 15:06 enridaga

I forgot to mention the PySPARQL-Anything setting, where multiple executions are performed within the same runtime (and probably the same execution context -- but this should be verified).

Jun 06 '23 15:06 enridaga

Each query has its own execution context so I think this makes caching useful only in the case of nested service clauses. However, the nested clauses must have the same sub-operation and the same configuration options, thus reducing its applicability by a lot.

Jun 07 '23 07:06 luigi-asprino

My fear is that the system caches (which has a cost) a lot of DatasetGraphs that will be used very rarely.

Jun 07 '23 07:06 luigi-asprino

My fear is that the system caches (which has a cost) a lot of DatasetGraphs that will be used very rarely.

Indeed.

Jun 07 '23 14:06 enridaga

We may disable caching by default

Jun 07 '23 16:06 luigi-asprino

We may disable caching by default

Let's first add the option to disable it so that we can experiment with the effects.

The nested query use case is quite common for me -- I use it to speed up joins between large sources (e.g. two large CSVs); without cache, a large CSV will be re-read and re-triplified from the file system for each query solution of the sub-SERVICE clause.

Jun 08 '23 10:06 enridaga

Agreed.

I'm struggling to find a good example of cache usage.

I've created a spreadsheet using =NOW() formula which returns the "serial number" (as it is called in the Office documentation) of the datetime when it is evaluated. If the triplification of the spreadsheet is cached, then the result of the formula is always the same even if the file is transformed multiple times. Then, I drafted this query:

PREFIX fx: <http://sparql.xyz/facade-x/ns/>
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE {
  SERVICE <x-sparql-anything:spreadsheet.evaluate-formulas=true> {
    fx:properties fx:location "%%%LOCATION%%%" .
          [] rdf:type fx:root ;
             fx:anySlot ?row .
          ?row rdf:_1 ?n .
          ?row rdf:_2 ?now

    SERVICE <x-sparql-anything:> {


       fx:properties fx:content "[1.0,2.0,3.0]" .
               fx:properties fx:media-type "application/json" .
               ?s fx:anySlot ?n .
    }


  }
}

(%%%LOCATION%%% is substituted with the file path of the spreadsheet at runtime). However, this query transforms the spreadsheet just once (damn nested queries!).

Jun 09 '23 10:06 luigi-asprino

After a discussion with @enridaga, we agreed on rethinking the caching system. In particular, the key used for storing and retrieving cached data must be redesigned. In fact, at the moment, the key is the concatenation of the options used for the triplification with a string representation of the operation (e.g. algebra of the service clause).

While the cache key must depend on properties, using the whole operation seems too restrictive. An idea could be extracting and verbalise (turning them into strings) the triple pattern within the operation (as they affect the triplification when triple filtering is enabled).

Jul 01 '23 11:07 luigi-asprino

Quick analysis of the general options and their influence on caching:

Property	Note	Cache Key
location*	The URL of the data source.	Yes
content*	The content to be transformed.	Yes
command*	An external command line to be executed. The output is handled according to the option 'media-type'	Yes
from-archive	The filename of the resource to be triplified within an archive.	Yes
root	The IRI of generated root resource.	Yes
media-type	The media-type of the data source.	Yes (different formats different triples)
namespace	The namespace prefix for the properties that will be generated.	Yes
blank-nodes	It tells SPARQL Anything to generate blank nodes or not.	Yes
trim-strings	Trim all string literals.	Yes
null-string	Do not produce triples where the specified string would be in the object position of the triple.	Yes
http.*	A set of options for customising HTTP request method, headers, querystring, and others. More details on the HTTP request configuration	Yes?
triplifier	It forces SPARQL Anything to use a specific triplifier for transforming the data source	Yes?
charset	The charset of the data source.	Yes?
metadata	It tells SPARQL Anything to extract metadata from the data source and to store it in the named graph with URI http://sparql.xyz/facade-x/data/metadata More details	Yes
ondisk	It tells SPARQL Anything to use an on disk graph (instead of the default in memory graph). The string should be a path to a directory where the on disk graph will be stored. Using an on disk graph is almost always slower (than using the default in memory graph) but with it you can triplify large files without running out of memory.	I don't know
ondisk.reuse	When using an on disk graph, it tells SPARQL Anything to reuse the previous on disk graph.	I don't know
strategy	The execution strategy. 0 = in memory, all triples; 1 = in memory, only triples matching any of the triple patterns in the where clause	Yes
slice	The resources is sliced and the SPARQL query executed on each one of the parts. Supported by: CSV (row by row); JSON (when array slice by item, when json object requires json.path); XML (requires xml.path)	Yes (Maybe incompatible with caching?)
use-rdfs-member	It tells SPARQL Anything to use the (super)property rdfs:member instead of container membership properties (rdf:_1, rdf:_2 ...)	Yes

Jul 14 '23 13:07 enridaga

So probably all the options should be considered

Jul 14 '23 13:07 luigi-asprino

Including the format specific ones

Jul 14 '23 13:07 luigi-asprino

Including the format specific ones

I don't know; maybe we look at each of them and decide. I think the may issue at the moment is that BGPs are bringing the outer context in. Also, the cache should be valid if a BGP that is more restrictive than the cached one is queried... (considering the triple filtering).

Jul 14 '23 14:07 enridaga

sparql.anything sparql.anything copied to clipboard

Caching system

sparql.anything
sparql.anything copied to clipboard