pisa
pisa copied to clipboard
Support direct indexing of json "vector" files
It is now very common to use external tools or libraries to produce pre-computed document representations (like those based on learned sparse retrieval).
In these cases, we might see a document collection as a jsonl file, with one document per line.
Anserini already support this format, for example: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/JsonVectorCollection.java
One example from Anserini with SPLADE-doc is:
{"id": 9, "contents": "", "vector": {"`": 0, "a": 183, "i": 25, "\u6e05": 46, "\uff5e": 30, "to": 29, "as": 16, "there": 34, "two": 19, "##i": 85, "most": 23, "##t": 34, "team": 70, "american": 43, "south": 46, "war": 32, "life": 100, "much": 60, "here": 8, "music": 3, "end": 33, "old": 25, "april": 62, "set": 39, "party": 68, "song": 6, "ve": 65, "population": 138, "top": 72, "book": 125, "door": 6, "st": 32, "received": 67, "##in": 90, "24": 55, "far": 44, "am": 4, "done": 12, "arms": 154, "summer": 153, "announced": 48, "records": 49, "design": 1, "considered": 10, "miles": 57, "points": 30, "person": 71, "china": 5, "official": 11, "wide": 26, "kept": 109, "##as": 106, "meet": 112, "goal": 20, "limited": 4, "sense": 2, "historic": 19, "lives": 7, "completely": 107, "annual": 0, "failed": 170, "##tion": 16, "expected": 25, "joe": 131, "##ba": 0, "daniel": 15, "mentioned": 71, "picked": 25, "settled": 186, "actress": 5, "reserve": 48, "jersey": 109, "remain": 36, "##go": 149, "##berg": 33, "sort": 0, "andrew": 28, "gets": 115, "sources": 138, "brand": 115, "documentary": 86, "lewis": 117, "##ding": 8, "promotion": 56, "soccer": 15, "5th": 81, "landing": 111, "journalist": 117, "familiar": 75, "productions": 60, "separated": 54, "##ker": 118, "amateur": 33, "li": 71, "membership": 55, "adapted": 89, "suggests": 10, "traveled": 12, "protest": 62, "baltimore": 72, "mitchell": 97, "beast": 12, "indicates": 199, "whisper": 0, "radar": 20, "isolated": 85, "slip": 7, "jefferson": 4, "grandson": 32, "reveals": 185, "##lon": 1, "ya": 4, "##bar": 60, "raced": 41, "halfway": 107, "manufacturers": 3, "dynamic": 11, "severely": 25, "cottage": 20, "ni": 8, "somerset": 40, "newport": 26, "forgot": 11, "chances": 43, "fees": 35, "saxon": 158, "kicking": 91, "testimony": 12, "genesis": 1, "charm": 7, "111": 34, "cart": 102, "mikhail": 4, "kirk": 40, "tanzania": 28, "##itt": 18, "russians": 59, "cnn": 59, "outlet": 202, "skeleton": 33, "##pling": 35, "##hol": 19, "##ipe": 27, "briggs": 50, "##right": 36, "workplace": 27, "alvarez": 8, "debbie": 118, "renee": 90, "reno": 67, "breuning": 33, "wiley": 38, "##fted": 2, "injected": 109, "##ego": 39, "maroon": 162, "kerman": 44, "minnie": 5, "merritt": 38, "goalscorer": 45}}
...
We could ideally build a tool that wraps both index compression and wand data creation for these file types. Alternatively, we can incorporate a json reader into each tool.
@JMMackenzie I'm wondering about the id field here. Do you know that this is in fact a number?
I suppose this may be a format that Anserini uses internally, and thus these are internal IDs. But if we were to feed some dataset to PISA, we need to give documents some titles/labels (string IDs); not to mention do not, as is, support arbitrary numerical IDs, they need to be consecutive numbers, so we need to assign them ourselves.
If necessary, we can parse it as "int or string", and represent the ints as strings. But I have to assume that when people index those collections, they have some kind of string labels.
Since we also support URLs, we can have an optional URL field as well.
It is now very common to use external tools or libraries to produce pre-computed document representations (like those based on learned sparse retrieval).
It would be helpful to have an example or two, or find people who have done it. Needs citation 😂
I can certainly start implementing this, and we can pin down the details as we go.
We could ideally build a tool that wraps both index compression and wand data creation for these file types. Alternatively, we can incorporate a json reader into each tool.
Well, if you think about it, only invert needs to read this format. create_wand_data takes the inverted index only.
What we'd need is to build the term/document lexicon, which is done when parsing rather than inverting.
The simples idea would be to build a forward index from the input file. This would be either using parse_collection or a separate tool. The immediate advantage would be that the rest of the workflow stays the same.
I think it the short term, this is the way to do it. As discussed elsewhere, right now we can simplify a lot of by introducing a "controller" process that will simplify the entire indexing, including this part.
I suspect id should be treated as a string, and that we should map them. I don't think we should assume that id comes as a sorted integer indexed from zero, in any case.
I agree with your take, though, that seems the easiest way to go.
For an example, look here: https://github.com/naver/splade?tab=readme-ov-file#evaluating-a-pre-trained-model
You want to check https://github.com/naver/splade/blob/main/splade/create_anserini.py