Byte-fallback tokens are not detokenized properly
Hello,
I have been producing a French-Chinese model a few months ago, and noticed that byte-fallback was yielding strange triplets of BF tokens... then, lately, to debug both Chinese and Japanese, I looked out, and realized triplets are the usual byte-encoding for CJK characters.
Except that UTF bytes start with \u or \x, not <0x as the byte-fallback sentencepiece tokens do. So a) I tried tweaking the shared_vocabulary within the concerned packages, which did no good upon encoding... and b) I remembered a former commit around the decoder : https://github.com/argosopentech/argos-translate/commit/543c50e7e467da990a27970f4df64ca52720bffe
Turns out, if I revert to former version, out-of-vocabulary characters function correctly.
As of the active underscores, I have tried
def decode(self, tokens: List[str]) -> str:
# detokenized = "".join(tokens)
# return detokenized.replace("▁", " ")
return self.lazy_processor().decode_pieces(tokens).replace("_", " ")
and it does not bug so far.