Sean MacAvaney comments

Results 228 comments of


                                            Sean MacAvaney

data

Hi- the format description of these files are given here: https://github.com/Georgetown-IR-Lab/cedr#getting-started In short, training pairs are sampled from lines like `[query-id] [doc-id]` and run files are the standard TREC run...

data

@wangxinzhe123 -- ultimately how you construct these files depends on your experimental setup. The main questions are: 1) What results do you want CEDR to re-rank? 2) What data do...

data

That again depends on what experiment you're running -- especially since you mention that you're running it with different datasets. Since you brought up Indri, here's documentation on it: https://sourceforge.net/p/lemur/wiki/IndriBuildIndex%20Parameters/...

File structure stated in msmarco_passage.py is not aligned with downloaded top1000.dev.tar.gz

Thanks for the report. I'm not able to reproduce it when following the instructions provided by the software: Specifically: When requesting scoreddocs of `msmarco-passage/dev/small`, I get the following message as...

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs

Thanks! I suspect it's this issue: https://github.com/allenai/ir_datasets/issues/151 There's a branch that fixes it, but for some reason, it hasn't been merged into the main branch: https://github.com/allenai/ir_datasets/tree/encoding-fixes I'll look into merging...

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3417: character maps to <undefined> when trying to decode docs

It also looks like the `FixEncoding` module was bypassed, which is why you're getting all the characters like `â€”`. (`FixEncoding` replaces them with their correct unicode versions.) As with #209,...

Permissions error on /tmp/ir_dataset directory due to multiple users on the same server

Hi @yuenherny -- it looks like this is a different issue. Do you have multiple processes open using ir_datasets? (E.g., multiple notebook instances)? As files are downloading, only a single...

Permissions error on /tmp/ir_dataset directory due to multiple users on the same server

> and when one hits an error, the process isn't closed automatically Gotcha -- thanks! This is a bug, as it should close the file in this case so others...

"every doc needs a text field/property" policy

Starting on this. Here's a list of all `NamedTuple`s for queries and docs: ``` [x] ir_datasets/datasets/aol_ia.py: AolIaDoc [x] ir_datasets/datasets/beir.py: BeirDoc [x] ir_datasets/datasets/beir.py: BeirTitleDoc [x] ir_datasets/datasets/beir.py: BeirTitleUrlDoc [ ] ir_datasets/datasets/beir.py: BeirSciDoc...

Fix and add html extractor

Thanks for bumping this PR @heinrichreimer, and thanks @grodino for the contribution!