pisa
pisa copied to clipboard
Index format v1.0
I am opening this one to start a discussion about our index format.
Ideally we would like to have a property file, say .prop where we store meta information about the index.
For now, I can think about storing in the property file an index version for compatibility purposes and the number of documents of the collection.
Right now the latter is stored inside the .docs file, which is not very elegant.
Also, if we do that we might remove the following as it is a duplicate.
https://github.com/pisa-engine/pisa/blob/46917a1d93e340c8cdd9e6b2b60b552d99df1779/include/pisa/forward_index_builder.hpp#L184
Is there anything else we should store in there?
This seems reasonable.
Just as an example, Indri uses an XML format for storing metadata, and Galago uses JSON I think.
For example, here is an Indri manifest file for an old Gov2 Index:
<parameters>
<code-build-date>Apr 21 2014</code-build-date>
<corpus>
<document-base>1</document-base>
<frequent-terms>177031</frequent-terms>
<maximum-document>25205180</maximum-document>
<total-documents>25205179</total-documents>
<total-terms>23451774775</total-terms>
<unique-terms>39177923</unique-terms>
</corpus>
<fields></fields>
<indri-distribution>Indri development release 5.6</indri-distribution>
<type>DiskIndex</type>
</parameters>
I propose that we could store the following information (perhaps):
- Version data
- Number of documents
- Number of postings lists (terms)
- Compression codec (? - would allow us to reduce command line args)
Would it also be worth storing a similar file for the wand structures?
- Variable vs Static blocks
- Range-oriented blocks (or not)
- Perhaps the mean block size
- Block compression
What do you think? This is just my thoughts, but it would definitely be nice to have this information.
Yes, it is going to look like the indri manifest.
I would use either yaml or json.
indri-distribution looks like what is going to be our version, so i would call it just version. We need to decide if our data version is going to corresponds to the software version. I would say so, it will make everything easier. We can rely on the major version to decide if an index is incompatible.
total-documents is what we need to (number of documents), but I dont see a need of document-base and maximum-document (at least for now).
unique-terms can be computed in O(1), but we can definitely add it for convenience. what do you think?
total-terms we also need this one for language model scorers
Now, this is only relative to uncompressed collections, which are {.docs, .freqs} files. What you are saying also makes sense, we can have an header in our compressed binary indexes and in additional data files in order to reduce the parameters passed and make CLI safer. I would defer this change to a separate discussion though.
I would use either yaml or json
Yes, definitely not XML. I think yaml is a good idea since it's very much readable, more so than JSON.
Should we go with https://github.com/jbeder/yaml-cpp ? Do you know any lighter or better?
This one seems reasonable.
https://arp242.net/weblog/json_as_configuration_files-_please_dont
An interesting point of view
Considering that we already have boost and we want to use either INI (which I prefer) or yaml, I would say to start with Boost.PropertyTree. Here an example: https://stackoverflow.com/a/15648662/5675412
Looks quite reasonable.