J Mackenzie

Results 12 issues of J Mackenzie

**Dataset Information:** A large (40M document) news corpus derived from CCNews, with associated query variations (UQVs). Currently, the corpus is best used for efficiency work, since there are no "official"...

add-dataset

**Describe the bug** As the title suggests; I had some issues with segfaults upon mmapping an index. It turned out that the index was being built (but incorrectly) because there...

bug

Currently, term weighting is handled within the `Cursors` classes. In particular, the `ScoredCursor` class stores the query term weight (the weight assigned to a term at query time, usually set...

question
refactoring

This was introduced via #387 - The problem is that the documentation for TBB states that passing a parameter `n` to `max_allowed_parallelism` will result in `n-1` worker threads operating: https://software.intel.com/en-us/node/589744...

bug
invalid

Given recent feedback from HN, we should look at improving how we explain PISA, and offer some benchmarks to common systems like Lucene and Tantivy (perhaps). We also should document...

help wanted
wip
documentation

@mpetri, @amallia, and I have come across a weird bug where an input JsonVectorCollection will have its weights broken by long terms, possibly impacting downstream ranking. The specific bug is...

Currently, Anserini is used to generate CIFF files with the [CIFF](https://github.com/osirrc/ciff) repo. A number of other systems like Terrier, PISA, JASSv2, OldDog can read/index CIFF files. However, Anserini doesn't currently...

**Describe the solution you'd like** There is a bunch of prior work on splitting postings lists; this allows the high impact list to be traversed (more likely) than the low...

enhancement
performance
wip
effort:medium

Hey all, I'm looking at the Efficiency Study paper and I'd like to replicate the query encoding numbers - could you please provide a pipeline or any other pointers so...

The bug at hand involved a file which had 2 or more new lines before any other textual data. The reason for the bug was that the tokenizer was not...