qlever
qlever copied to clipboard
List of files to parse instead of single input file
Datasets like UniProt come in hundreds of files. It would be nice if these could be opened in parallel instead of needing to be catted into one single file.
I proppose a simple new input format for IndexBuilderMain -F lof
(List Of Files)
Each of these files can be read in parallel in the best way possible.
I suggest that the format of the List Of Files format is a simple CSV.
Format | graph iri | file name |
---|---|---|
ntriples,turtle,mmapable turtle | not yet used but interesting for quad support | a path to the file to parse |
This would be very helpful!
Yes, we understand and we will do that. In the meantime, let me clarify the following:
-
One can simply use
cat
,xzcat
, orbzcat
to pipe several files into the index builder. For example:bzcat *.bz2 | IndexBuilderMail -F ttl -f - ...
That's the beauty of RDF that you can simply merge different datasets by concatenating the lists of triples. -
For TTL, QLever currently assumes that all the prefix declarations come at the beginning. With several TTL files, each with its own prefix declarations at the beginning, one therefore needs a little more work. For example, assuming that the prefix definitions appear among the first 50 lines for each file:
( ls *.ttl.xz | while read TTL; do xzcat ${TTL} | head -50 | grep ^@prefix; done | sort -u && *.ttl.xz | while read TTL; do xzcat $TTL | grep -v ^@prefix; done ) | IndexBuilderMain -F ttl -f - ...
-
Additional care is needed when the prefix declarations of the different input datasets are incompatible with each other. For the UniProt data this happens for a few dataset with the empty prefix. This is easy to fix by preprocessing the respective datasets. I will post my full workflow for indexing UniProt on a page in the GitHub Wiki for Qlever.
@JervenBolleman Here is my log of a recent build for the complete UniProt data: https://github.com/ad-freiburg/qlever/wiki/Using-QLever-for-UniProt
@hannahbast minor comment regarding 1. This only works if there aren't blank nodes in multiple files with the same bnode label. The scope of a blank node is by definition just document local, but in case of just concatenating N-Triples or Turtle files, this could indeed lead to unintended data loaded into QLever.