fastembed
fastembed copied to clipboard
Token ID computation in BM25/BM42: Possible id overlap due to absolute value casting of hash
Found thus only cause I was curious how fastembed calculates token ids for bm25/bm42, so I took a deep dive:
Problem
The function compute_token_id computes the absolute value of a signed 32 bit integer returned by the mmh3 hash lib.
@classmethod
def compute_token_id(cls, token: str) -> int:
return abs(mmh3.hash(token))
This could lead to token id overlap, e.g. if hash of "black" is -42 and "white" is 42 (this is exaggerated of course), the token id would be 42 for both of them.
Is this expexted behaviour?
Proposed Solution
The hash call could also provide an unsigned 32 bit int (casted to a positive 64bit int, which would be no problem for qdrant since its 64 bit native). This could be achieved by specifying the following extra arguments:
@classmethod
def compute_token_id(cls, token: str) -> int:
return mmh3.hash(token, seed=0, signed=False)
I'm happy to provide a PR if you would like this behaviour to be changed.
Hi @freinold ,
Thank you for highlighting this issue! We've decided to take a bit more sophisticated approach than the suggested, and it requires changes in the core. I'll post here a link to the issue/pr in the core once it is available