anserini JsonVectorCollection weights are not obeyed for long terms

JsonVectorCollection weights are not obeyed for long terms

Open JMMackenzie opened this issue 2 years ago • 3 comments

@mpetri, @amallia, and I have come across a weird bug where an input JsonVectorCollection will have its weights broken by long terms, possibly impacting downstream ranking.

The specific bug is because of a series of design choices.

Anserini "clones" a term with a given weight value weight times (pseudo document generation) to offload the actual indexing to Lucene (without tinkering with internals).
Inside Lucene, the default maximum term length is 255 chars (see https://lucene.apache.org/core/8_0_0/core/constant-values.html#org.apache.lucene.analysis.standard.StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH).

So, getting down to the messy bits.

Assume you have a term coming into your vector with 256 characters and a weight of 200.

What happens is that term is split at the 255th character, leaving the final character dangling as its own term. Then, this can mess up the underlying impacts.

A toy example:

{"id": "problem", "contents": "", "vector": {"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaX" : 200, "X" : 200}}

This will result in an index with "X" having an impact of 400 (!!!!!) instead of 200.

Clearly this then flows on to downstream indexing/querying tasks.

One solution we found was overriding the default value of 255 in the constructor for the WhitespaceAnalyzer (see https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexCollection.java#L768). We set to the max permissible value of 1048576 which solves the problem.

Apr 14 '22 00:04 JMMackenzie

wow, what an obscure bug!

How about we just drop all terms longer than 255 chars? They are unlikely to be meaningful anyway?

Apr 14 '22 01:04 lintool

Is it possible to log output if so? But yeah, this would at least be better than silently mutating those terms I think...

Apr 14 '22 02:04 JMMackenzie

sure!

Issue noted and PR welcome - but this is lowish on our priority list to fix...

Apr 14 '22 02:04 lintool

anserini anserini copied to clipboard

JsonVectorCollection weights are not obeyed for long terms

anserini
anserini copied to clipboard