Sean MacAvaney comments

Results 229 comments of


                                            Sean MacAvaney

Direct access to all doc_ids

The easiest way to load all doc_ids is: ```python data = ir_datasets.load("msmarco-document/train") all_doc_ids = [d.doc_id for d in data.docs] ``` But, as you say, this iterates over all documents. I...

Direct access to all doc_ids

Great, glad the lookups by index work for what you need! I think the risk of doing: ```python all_doc_ids = list(data.docs._handler.docs_store().lookup.idx()) ``` is that it's sorted (lexically) by docid, enabling...

Direct access to all doc_ids

Hey @ArthurCamara -- quick update on this. Over the past few months I've been working on an alternative file format to facilitate `doc_id->idx` and `idx->doc_id` lookups, iteration over `doc_id`s, etc....

handling .z files as gzip

Hmmm, okay, I see. Even though the format is relatively simple, I'm not so keen on writing my own parser for the format. So the only other reasonable option for...

TREC Podcasts

I have a copy of the corpus. I think there are interesting questions here about how to incorporate the fact that it's (essentially) a fixed-length passage retrieval task. I.e., should...

beir suite

A downside of adding these is that they only consist of test components. If folks wanted to train on some of these (at least, the ones that have training data),...

beir suite

I stand corrected on the topic of metadata & other fields. Some is provided for queries and docs under the `metadata` key.

beir suite

@searchivarius reminds me that the BEIR doc objects ought to be better structured, especially RE the metadata. Most are either `(doc_id, text)` or `(doc_id, title, text)`. A few have (undocumented?)...

Datamaestro integration

Awesome! Let me know if there are changes in ir_datasets that could help facilitate this. You can access the documentation for a given dataset via `dataset.documentation()`, which returns a dict....

TREC CAST

Looks like it uses TREC CAR v2, so depends on #5