qlever icon indicating copy to clipboard operation
qlever copied to clipboard

List of files to parse instead of single input file

Open JervenBolleman opened this issue 2 years ago • 4 comments

Datasets like UniProt come in hundreds of files. It would be nice if these could be opened in parallel instead of needing to be catted into one single file.

I proppose a simple new input format for IndexBuilderMain -F lof (List Of Files) Each of these files can be read in parallel in the best way possible. I suggest that the format of the List Of Files format is a simple CSV.

Format graph iri file name
ntriples,turtle,mmapable turtle not yet used but interesting for quad support a path to the file to parse

JervenBolleman avatar Apr 04 '22 13:04 JervenBolleman

This would be very helpful!

bilalshaikh42 avatar Apr 21 '22 18:04 bilalshaikh42

Yes, we understand and we will do that. In the meantime, let me clarify the following:

  1. One can simply use cat, xzcat, or bzcat to pipe several files into the index builder. For example: bzcat *.bz2 | IndexBuilderMail -F ttl -f - ... That's the beauty of RDF that you can simply merge different datasets by concatenating the lists of triples.

  2. For TTL, QLever currently assumes that all the prefix declarations come at the beginning. With several TTL files, each with its own prefix declarations at the beginning, one therefore needs a little more work. For example, assuming that the prefix definitions appear among the first 50 lines for each file: ( ls *.ttl.xz | while read TTL; do xzcat ${TTL} | head -50 | grep ^@prefix; done | sort -u && *.ttl.xz | while read TTL; do xzcat $TTL | grep -v ^@prefix; done ) | IndexBuilderMain -F ttl -f - ...

  3. Additional care is needed when the prefix declarations of the different input datasets are incompatible with each other. For the UniProt data this happens for a few dataset with the empty prefix. This is easy to fix by preprocessing the respective datasets. I will post my full workflow for indexing UniProt on a page in the GitHub Wiki for Qlever.

hannahbast avatar Apr 21 '22 22:04 hannahbast

@JervenBolleman Here is my log of a recent build for the complete UniProt data: https://github.com/ad-freiburg/qlever/wiki/Using-QLever-for-UniProt

hannahbast avatar Apr 27 '22 20:04 hannahbast

@hannahbast minor comment regarding 1. This only works if there aren't blank nodes in multiple files with the same bnode label. The scope of a blank node is by definition just document local, but in case of just concatenating N-Triples or Turtle files, this could indeed lead to unintended data loaded into QLever.

LorenzBuehmann avatar May 31 '22 05:05 LorenzBuehmann