Sean MacAvaney comments

Results 224 comments of


                                            Sean MacAvaney

Py tokeniser - WIP

The EnglishTokenizer implementation was around 4x slower than the Java one, so I took a stab at improving it using regular expressions (https://github.com/terrier-org/pyterrier/pull/508/commits/aaf486e9751ea57731b2cd6713dab7993ad6c6a1). This brought up throughput to about the...

Py tokeniser - WIP

Alright, both the English and UTF tokenizers are only around 20% slower than the Java versions and they match the tokenization on the first million documents from msmarco-passage. I'm not...

Py tokeniser - WIP

This seems like a reasonable place for them, given that they replicate Terrier's tokenizer behaviors, which is not necessarily what we'd suggest in general. (For example, the UTFTokeniser skips everything...

TREC 2024 Tip-of-the-Tongue

fixed with #272, sorry on the delay!

MS MARCO v2.1 and v2.1 segmented for TREC 2024 RAG

Awesome, thanks! I'll take a look at it tomorrow and see if I can tick some of the other tasks :)

Add msmarco v2.1 trec rag

Hey- can you try out the revision? I was getting errors running the tests before, so I refactored a bit. Now it's using the classes from v2 where possible.

Add msmarco v2.1 trec rag

Digging into this...

TerrierIndex (artifact) API support loading indices into memory

A few options off the top of my head: - Option A: Constructor `TerrierIndex(path, in_memory=True)` - Downside: 1 Not great when doing stuff like from_hf() - Downside 2: It's not...

TerrierIndex (artifact) API support loading indices into memory

Right, since the retriever takes an index reference instead. Sounds reasonable. Option A could work with from_hf() if we let it take kwargs and pass them through. But that's a...

TerrierIndex (artifact) API support loading indices into memory

I'm open to it as long as we're careful not to swallow kwargs in the artifact's constructor, please!