J Mackenzie
J Mackenzie
**Dataset Information:** A large (40M document) news corpus derived from CCNews, with associated query variations (UQVs). Currently, the corpus is best used for efficiency work, since there are no "official"...
Disk-streamed indexing can result in non-deterministic indexes if there is insufficient /tmp storage
**Describe the bug** As the title suggests; I had some issues with segfaults upon mmapping an index. It turned out that the index was being built (but incorrectly) because there...
Currently, term weighting is handled within the `Cursors` classes. In particular, the `ScoredCursor` class stores the query term weight (the weight assigned to a term at query time, usually set...
This was introduced via #387 - The problem is that the documentation for TBB states that passing a parameter `n` to `max_allowed_parallelism` will result in `n-1` worker threads operating: https://software.intel.com/en-us/node/589744...
Given recent feedback from HN, we should look at improving how we explain PISA, and offer some benchmarks to common systems like Lucene and Tantivy (perhaps). We also should document...
@mpetri, @amallia, and I have come across a weird bug where an input JsonVectorCollection will have its weights broken by long terms, possibly impacting downstream ranking. The specific bug is...
Currently, Anserini is used to generate CIFF files with the [CIFF](https://github.com/osirrc/ciff) repo. A number of other systems like Terrier, PISA, JASSv2, OldDog can read/index CIFF files. However, Anserini doesn't currently...
**Describe the solution you'd like** There is a bunch of prior work on splitting postings lists; this allows the high impact list to be traversed (more likely) than the low...
Hey all, I'm looking at the Efficiency Study paper and I'd like to replicate the query encoding numbers - could you please provide a pipeline or any other pointers so...
The bug at hand involved a file which had 2 or more new lines before any other textual data. The reason for the bug was that the tokenizer was not...