ir_datasets
ir_datasets copied to clipboard
Direct access to all doc_ids
This is something I was expecting to be quite straightforward (or at least better documented in the API) but it doesn't seem to be. Say I want to gather all doc_ids from a given corpus (for instance, if I want to use a random negative sampler on run time). Currently, this is what I do:
data = ir_datasets.load("msmarco-document/train")
all_doc_ids = list(data.docs._handler.docs_store().lookup.idx())
which is fine, but, from what I can get, this triggers an iteration over all docs in the collection (and is also not very intuitive).
Is there a better way to achieve this?
The easiest way to load all doc_ids is:
data = ir_datasets.load("msmarco-document/train")
all_doc_ids = [d.doc_id for d in data.docs]
But, as you say, this iterates over all documents.
I think it would be straightforward enough to add a new API for iterating over just the document IDs, if you think it would be valuable. Maybe exposed as something like: data.docs.doc_ids
.
But.... For your particular use case, I think you may not actually need the doc_id
s themselves. You can just sample by index instead of by doc_id, eliminating the need for loading doc_ids at all. For instance, you could do:
num_docs = len(data.docs)
idx = random.randrange(num_docs)
data.docs[idx]
Lookups by index are fast (especially on SSD) and do not load corpus into memory, once a docstore is built (which it does automatically, and is needed anyway for lookups by doc_id
).
Ok, this also works, iterating by the index of the document!
As for
all_doc_ids = [d.doc_id for d in data.docs]
versus
all_doc_ids = list(data.docs._handler.docs_store().lookup.idx())
The second one seems slightly faster (could be because I was trying the first one before, and I had a tqdm
for loop encapsulating it that could be adding some bottleneck).
As for including the API, yes, that shouldn't be very hard. I can do it early next week, if that's ok.
Great, glad the lookups by index work for what you need!
I think the risk of doing:
all_doc_ids = list(data.docs._handler.docs_store().lookup.idx())
is that it's sorted (lexically) by docid, enabling lookups. So the indices will not necessarily align, which users may expect. Maybe this is alright, but we probably want to think a bit more of the design here before pushing this through.
Not all datasets use a lz4docstore (e.g., the ClueWebs) because we don't want to make a copy of huge corpora. So some consideration of these cases should be made.
Hey @ArthurCamara -- quick update on this. Over the past few months I've been working on an alternative file format to facilitate doc_id->idx
and idx->doc_id
lookups, iteration over doc_id
s, etc. It also aims to ditch the searchsorted
approach for doc_id->idx
lookups in favor of an on-disk hash table, since the former requires doc_ids to be padded to the same length (adding considerable size to some lookups) and has an unfavourable access pattern on disk, which makes it a bit slow until everything is loaded into the cache.
Not sure when it'll be ready for primetime, but just letting you know that a solution to this is in the works.
That sounds awesome, @seanmacavaney. Thanks for letting me know!