Sean MacAvaney

Results 224 comments of Sean MacAvaney

The EnglishTokenizer implementation was around 4x slower than the Java one, so I took a stab at improving it using regular expressions (https://github.com/terrier-org/pyterrier/pull/508/commits/aaf486e9751ea57731b2cd6713dab7993ad6c6a1). This brought up throughput to about the...

Alright, both the English and UTF tokenizers are only around 20% slower than the Java versions and they match the tokenization on the first million documents from msmarco-passage. I'm not...

This seems like a reasonable place for them, given that they replicate Terrier's tokenizer behaviors, which is not necessarily what we'd suggest in general. (For example, the UTFTokeniser skips everything...

fixed with #272, sorry on the delay!

Awesome, thanks! I'll take a look at it tomorrow and see if I can tick some of the other tasks :)

Hey- can you try out the revision? I was getting errors running the tests before, so I refactored a bit. Now it's using the classes from v2 where possible.

Digging into this...

A few options off the top of my head: - Option A: Constructor `TerrierIndex(path, in_memory=True)` - Downside: 1 Not great when doing stuff like from_hf() - Downside 2: It's not...

Right, since the retriever takes an index reference instead. Sounds reasonable. Option A could work with from_hf() if we let it take kwargs and pass them through. But that's a...

I'm open to it as long as we're careful not to swallow kwargs in the artifact's constructor, please!