anserini
anserini copied to clipboard
JsonVectorCollection weights are not obeyed for long terms
@mpetri, @amallia, and I have come across a weird bug where an input JsonVectorCollection will have its weights broken by long terms, possibly impacting downstream ranking.
The specific bug is because of a series of design choices.
-
Anserini "clones" a term with a given weight value
weight
times (pseudo document generation) to offload the actual indexing to Lucene (without tinkering with internals). -
Inside Lucene, the default maximum term length is 255 chars (see https://lucene.apache.org/core/8_0_0/core/constant-values.html#org.apache.lucene.analysis.standard.StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH).
So, getting down to the messy bits.
Assume you have a term coming into your vector with 256 characters and a weight of 200.
What happens is that term is split at the 255th character, leaving the final character dangling as its own term. Then, this can mess up the underlying impacts.
A toy example:
{"id": "problem", "contents": "", "vector": {"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaX" : 200, "X" : 200}}
This will result in an index with "X" having an impact of 400 (!!!!!) instead of 200.
Clearly this then flows on to downstream indexing/querying tasks.
One solution we found was overriding the default value of 255 in the constructor for the WhitespaceAnalyzer
(see https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexCollection.java#L768). We set to the max permissible value of 1048576
which solves the problem.
wow, what an obscure bug!
How about we just drop all terms longer than 255 chars? They are unlikely to be meaningful anyway?
Is it possible to log output if so? But yeah, this would at least be better than silently mutating those terms I think...
sure!
Issue noted and PR welcome - but this is lowish on our priority list to fix...