Sean MacAvaney

Results 229 comments of Sean MacAvaney

The easiest way to load all doc_ids is: ```python data = ir_datasets.load("msmarco-document/train") all_doc_ids = [d.doc_id for d in data.docs] ``` But, as you say, this iterates over all documents. I...

Great, glad the lookups by index work for what you need! I think the risk of doing: ```python all_doc_ids = list(data.docs._handler.docs_store().lookup.idx()) ``` is that it's sorted (lexically) by docid, enabling...

Hey @ArthurCamara -- quick update on this. Over the past few months I've been working on an alternative file format to facilitate `doc_id->idx` and `idx->doc_id` lookups, iteration over `doc_id`s, etc....

Hmmm, okay, I see. Even though the format is relatively simple, I'm not so keen on writing my own parser for the format. So the only other reasonable option for...

I have a copy of the corpus. I think there are interesting questions here about how to incorporate the fact that it's (essentially) a fixed-length passage retrieval task. I.e., should...

A downside of adding these is that they only consist of test components. If folks wanted to train on some of these (at least, the ones that have training data),...

I stand corrected on the topic of metadata & other fields. Some is provided for queries and docs under the `metadata` key.

@searchivarius reminds me that the BEIR doc objects ought to be better structured, especially RE the metadata. Most are either `(doc_id, text)` or `(doc_id, title, text)`. A few have (undocumented?)...

Awesome! Let me know if there are changes in ir_datasets that could help facilitate this. You can access the documentation for a given dataset via `dataset.documentation()`, which returns a dict....

Looks like it uses TREC CAR v2, so depends on #5