J Mackenzie issues

Results 12 issues of


                                            J Mackenzie

CC-News-En Support

**Dataset Information:** A large (40M document) news corpus derived from CCNews, with associated query variations (UQVs). Currently, the corpus is best used for efficiency work, since there are no "official"...

add-dataset

Disk-streamed indexing can result in non-deterministic indexes if there is insufficient /tmp storage

**Describe the bug** As the title suggests; I had some issues with segfaults upon mmapping an index. It turned out that the index was being built (but incorrectly) because there...

bug

Refactor query term weights

Currently, term weighting is handled within the `Cursors` classes. In particular, the `ScoredCursor` class stores the query term weight (the weight assigned to a term at query time, usually set...

question

refactoring

Single threaded execution requires"0" threads

This was introduced via #387 - The problem is that the documentation for TBB states that passing a parameter `n` to `max_allowed_parallelism` will result in `n-1` worker threads operating: https://software.intel.com/en-us/node/589744...

bug

invalid

Improving the understanding of what PISA is

Given recent feedback from HN, we should look at improving how we explain PISA, and offer some benchmarks to common systems like Lucene and Tantivy (perhaps). We also should document...

help wanted

wip

documentation

JsonVectorCollection weights are not obeyed for long terms

@mpetri, @amallia, and I have come across a weird bug where an input JsonVectorCollection will have its weights broken by long terms, possibly impacting downstream ranking. The specific bug is...

Support CIFF readbacks

Currently, Anserini is used to generate CIFF files with the [CIFF](https://github.com/osirrc/ciff) repo. A number of other systems like Terrier, PISA, JASSv2, OldDog can read/index CIFF files. However, Anserini doesn't currently...

List Splitting/Clipping

**Describe the solution you'd like** There is a bunch of prior work on splitting postings lists; this allows the high impact list to be traversed (more likely) than the low...

enhancement

performance

wip

effort:medium

Inference Experiments

Hey all, I'm looking at the Efficiency Study paper and I'd like to replicate the query encoding numbers - could you please provide a pipeline or any other pointers so...

Fixed bug regarding New lines in input files

The bug at hand involved a file which had 2 or more new lines before any other textual data. The reason for the bug was that the tokenizer was not...