sparql.anything
sparql.anything copied to clipboard
Caching system
The cache is wiped at every query execution as it is stored as a hash map within the execution context. Therefore, the cache is effective only when the same resource is queried multiple times in the same query (e.g. multiple service clauses having as the same resource as location, same properties and same sub-operation). This makes me question the usefulness of the cache.
An alternative could be decoupling the caching from the execution context, and initialising the cache once at the startup of the system.
This makes me question the usefulness of the cache.
In the case of nested service clauses, this is sufficient to avoid multiple re-engineering of the same file, the RDF is kept in memory, and subsequent queries are performed on the cache.
The cache is indeed per-execution based; this makes it less potent in a server setting (Fuseki runnable).
In the case of the CLI, there is one execution only (except when queries are parametrized).
We need to verify what happens with parametrised queries; in that case, there are multiple executions within the same runtime, and we should check if the cache is brought over or wiped.
I forgot to mention the PySPARQL-Anything setting, where multiple executions are performed within the same runtime (and probably the same execution context -- but this should be verified).
Each query has its own execution context so I think this makes caching useful only in the case of nested service clauses. However, the nested clauses must have the same sub-operation and the same configuration options, thus reducing its applicability by a lot.
My fear is that the system caches (which has a cost) a lot of DatasetGraphs that will be used very rarely.
My fear is that the system caches (which has a cost) a lot of DatasetGraphs that will be used very rarely.
Indeed.
We may disable caching by default
We may disable caching by default
Let's first add the option to disable it so that we can experiment with the effects.
The nested query use case is quite common for me -- I use it to speed up joins between large sources (e.g. two large CSVs); without cache, a large CSV will be re-read and re-triplified from the file system for each query solution of the sub-SERVICE clause.
Agreed.
I'm struggling to find a good example of cache usage.
I've created a spreadsheet using =NOW()
formula which returns the "serial number" (as it is called in the Office documentation) of the datetime when it is evaluated.
If the triplification of the spreadsheet is cached, then the result of the formula is always the same even if the file is transformed multiple times.
Then, I drafted this query:
PREFIX fx: <http://sparql.xyz/facade-x/ns/>
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE {
SERVICE <x-sparql-anything:spreadsheet.evaluate-formulas=true> {
fx:properties fx:location "%%%LOCATION%%%" .
[] rdf:type fx:root ;
fx:anySlot ?row .
?row rdf:_1 ?n .
?row rdf:_2 ?now
SERVICE <x-sparql-anything:> {
fx:properties fx:content "[1.0,2.0,3.0]" .
fx:properties fx:media-type "application/json" .
?s fx:anySlot ?n .
}
}
}
(%%%LOCATION%%%
is substituted with the file path of the spreadsheet at runtime).
However, this query transforms the spreadsheet just once (damn nested queries!).
After a discussion with @enridaga, we agreed on rethinking the caching system. In particular, the key used for storing and retrieving cached data must be redesigned. In fact, at the moment, the key is the concatenation of the options used for the triplification with a string representation of the operation (e.g. algebra of the service clause).
While the cache key must depend on properties, using the whole operation seems too restrictive. An idea could be extracting and verbalise (turning them into strings) the triple pattern within the operation (as they affect the triplification when triple filtering is enabled).
Quick analysis of the general options and their influence on caching:
Property | Note | Cache Key |
---|---|---|
location* | The URL of the data source. | Yes |
content* | The content to be transformed. | Yes |
command* | An external command line to be executed. The output is handled according to the option 'media-type' | Yes |
from-archive | The filename of the resource to be triplified within an archive. | Yes |
root | The IRI of generated root resource. | Yes |
media-type | The media-type of the data source. | Yes (different formats different triples) |
namespace | The namespace prefix for the properties that will be generated. | Yes |
blank-nodes | It tells SPARQL Anything to generate blank nodes or not. | Yes |
trim-strings | Trim all string literals. | Yes |
null-string | Do not produce triples where the specified string would be in the object position of the triple. | Yes |
http.* | A set of options for customising HTTP request method, headers, querystring, and others. More details on the HTTP request configuration | Yes? |
triplifier | It forces SPARQL Anything to use a specific triplifier for transforming the data source | Yes? |
charset | The charset of the data source. | Yes? |
metadata | It tells SPARQL Anything to extract metadata from the data source and to store it in the named graph with URI http://sparql.xyz/facade-x/data/metadata More details | Yes |
ondisk | It tells SPARQL Anything to use an on disk graph (instead of the default in memory graph). The string should be a path to a directory where the on disk graph will be stored. Using an on disk graph is almost always slower (than using the default in memory graph) but with it you can triplify large files without running out of memory. | I don't know |
ondisk.reuse | When using an on disk graph, it tells SPARQL Anything to reuse the previous on disk graph. | I don't know |
strategy | The execution strategy. 0 = in memory, all triples; 1 = in memory, only triples matching any of the triple patterns in the where clause | Yes |
slice | The resources is sliced and the SPARQL query executed on each one of the parts. Supported by: CSV (row by row); JSON (when array slice by item, when json object requires json.path); XML (requires xml.path) | Yes (Maybe incompatible with caching?) |
use-rdfs-member | It tells SPARQL Anything to use the (super)property rdfs:member instead of container membership properties (rdf:_1, rdf:_2 ...) | Yes |
So probably all the options should be considered
Including the format specific ones
Including the format specific ones
I don't know; maybe we look at each of them and decide. I think the may issue at the moment is that BGPs are bringing the outer context in. Also, the cache should be valid if a BGP that is more restrictive than the cached one is queried... (considering the triple filtering).