Byte-fallback tokens are not detokenized properly

Open lecoqnicolas opened this issue 1 year ago • 0 comments

Hello,

I have been producing a French-Chinese model a few months ago, and noticed that byte-fallback was yielding strange triplets of BF tokens... then, lately, to debug both Chinese and Japanese, I looked out, and realized triplets are the usual byte-encoding for CJK characters.

Except that UTF bytes start with \u or \x, not <0x as the byte-fallback sentencepiece tokens do. So a) I tried tweaking the shared_vocabulary within the concerned packages, which did no good upon encoding... and b) I remembered a former commit around the decoder : https://github.com/argosopentech/argos-translate/commit/543c50e7e467da990a27970f4df64ca52720bffe

Turns out, if I revert to former version, out-of-vocabulary characters function correctly.

As of the active underscores, I have tried

    def decode(self, tokens: List[str]) -> str:
#        detokenized = "".join(tokens)
#        return detokenized.replace("▁", " ")
        return self.lazy_processor().decode_pieces(tokens).replace("_", " ")

and it does not bug so far.

Jan 15 '25 16:01 lecoqnicolas