huggingface-tokenizer-in-cxx icon indicating copy to clipboard operation
huggingface-tokenizer-in-cxx copied to clipboard

RE2 does not support look-ahead

Open wangkuiyi opened this issue 2 years ago • 0 comments

So, the C++ tokenizer generates a slightly different output than that of the HuggingFace tokenzer if the input text contains more than one successive whitespaces.

cmake --build /tmp/b
/tmp/b/bin/bpe_test > /tmp/c
python tool/t.py > /tmp/t
python tool/cmp.py /tmp/c /tmp/t /tmp/sample.txt
Screenshot 2023-02-10 at 12 18 02 PM

wangkuiyi avatar Feb 10 '23 20:02 wangkuiyi