pisa icon indicating copy to clipboard operation
pisa copied to clipboard

Index format v1.0

Open amallia opened this issue 6 years ago • 8 comments

I am opening this one to start a discussion about our index format.

Ideally we would like to have a property file, say .prop where we store meta information about the index. For now, I can think about storing in the property file an index version for compatibility purposes and the number of documents of the collection. Right now the latter is stored inside the .docs file, which is not very elegant. Also, if we do that we might remove the following as it is a duplicate. https://github.com/pisa-engine/pisa/blob/46917a1d93e340c8cdd9e6b2b60b552d99df1779/include/pisa/forward_index_builder.hpp#L184

Is there anything else we should store in there?

amallia avatar Apr 02 '19 20:04 amallia

This seems reasonable.

Just as an example, Indri uses an XML format for storing metadata, and Galago uses JSON I think.

For example, here is an Indri manifest file for an old Gov2 Index:

<parameters>
	<code-build-date>Apr 21 2014</code-build-date>
	<corpus>
		<document-base>1</document-base>
		<frequent-terms>177031</frequent-terms>
		<maximum-document>25205180</maximum-document>
		<total-documents>25205179</total-documents>
		<total-terms>23451774775</total-terms>
		<unique-terms>39177923</unique-terms>
	</corpus>
	<fields></fields>
	<indri-distribution>Indri development release 5.6</indri-distribution>
	<type>DiskIndex</type>
</parameters>

I propose that we could store the following information (perhaps):

  • Version data
  • Number of documents
  • Number of postings lists (terms)
  • Compression codec (? - would allow us to reduce command line args)

Would it also be worth storing a similar file for the wand structures?

  • Variable vs Static blocks
  • Range-oriented blocks (or not)
  • Perhaps the mean block size
  • Block compression

What do you think? This is just my thoughts, but it would definitely be nice to have this information.

JMMackenzie avatar Apr 02 '19 22:04 JMMackenzie

Yes, it is going to look like the indri manifest.
I would use either yaml or json.

indri-distribution looks like what is going to be our version, so i would call it just version. We need to decide if our data version is going to corresponds to the software version. I would say so, it will make everything easier. We can rely on the major version to decide if an index is incompatible.

total-documents is what we need to (number of documents), but I dont see a need of document-base and maximum-document (at least for now).

unique-terms can be computed in O(1), but we can definitely add it for convenience. what do you think?

total-terms we also need this one for language model scorers

Now, this is only relative to uncompressed collections, which are {.docs, .freqs} files. What you are saying also makes sense, we can have an header in our compressed binary indexes and in additional data files in order to reduce the parameters passed and make CLI safer. I would defer this change to a separate discussion though.

amallia avatar Apr 02 '19 22:04 amallia

I would use either yaml or json

Yes, definitely not XML. I think yaml is a good idea since it's very much readable, more so than JSON.

elshize avatar Apr 08 '19 20:04 elshize

Should we go with https://github.com/jbeder/yaml-cpp ? Do you know any lighter or better?

amallia avatar Apr 08 '19 23:04 amallia

This one seems reasonable.

elshize avatar Apr 09 '19 12:04 elshize

https://arp242.net/weblog/json_as_configuration_files-_please_dont

An interesting point of view

amallia avatar Apr 13 '19 20:04 amallia

Considering that we already have boost and we want to use either INI (which I prefer) or yaml, I would say to start with Boost.PropertyTree. Here an example: https://stackoverflow.com/a/15648662/5675412

amallia avatar Apr 18 '19 09:04 amallia

Looks quite reasonable.

elshize avatar Apr 18 '19 09:04 elshize