fastembed Token ID computation in BM25/BM42: Possible id overlap due to absolute value casting of hash

Token ID computation in BM25/BM42: Possible id overlap due to absolute value casting of hash

Open freinold opened this issue 1 year ago • 1 comments

Found thus only cause I was curious how fastembed calculates token ids for bm25/bm42, so I took a deep dive:

Problem

The function compute_token_id computes the absolute value of a signed 32 bit integer returned by the mmh3 hash lib.

    @classmethod
    def compute_token_id(cls, token: str) -> int:
        return abs(mmh3.hash(token))

This could lead to token id overlap, e.g. if hash of "black" is -42 and "white" is 42 (this is exaggerated of course), the token id would be 42 for both of them.

Is this expexted behaviour?

Proposed Solution

The hash call could also provide an unsigned 32 bit int (casted to a positive 64bit int, which would be no problem for qdrant since its 64 bit native). This could be achieved by specifying the following extra arguments:

    @classmethod
    def compute_token_id(cls, token: str) -> int:
        return mmh3.hash(token, seed=0, signed=False)

I'm happy to provide a PR if you would like this behaviour to be changed.

Oct 17 '24 18:10 freinold

Hi @freinold ,

Thank you for highlighting this issue! We've decided to take a bit more sophisticated approach than the suggested, and it requires changes in the core. I'll post here a link to the issue/pr in the core once it is available

Oct 18 '24 14:10 joein

fastembed fastembed copied to clipboard

Token ID computation in BM25/BM42: Possible id overlap due to absolute value casting of hash

Problem

Proposed Solution

fastembed
fastembed copied to clipboard