huggingface-tokenizer-in-cxx icon indicating copy to clipboard operation
huggingface-tokenizer-in-cxx copied to clipboard

Results 9 huggingface-tokenizer-in-cxx issues
Sort by recently updated
recently updated
newest added
trafficstars

Hello @wangkuiyi , It seems this tokenizer only supports one special token "". Does it support other additional special tokens? For instatnce the ones we added in special_tokens_map.json, like `"",...

你好,貌似我们这个工程不支持中文切词 ? 请问 merges.txt 和 vocab.txt 是什么关系呢 ? 我有一份包含中文的 vocab.txt 文件,但是没有对应的 merges.txt

it seems cannot support append char?

Hello, does your project support decoder for byte-level-bpe?

Good work~ But I ran some tests and found this c++ implementation seems to be slow. Less than 10 tokens per millisecond. Any more tests or findings?

So, the C++ tokenizer generates a slightly different output than that of the HuggingFace tokenzer if the input text contains more than one successive whitespaces. ```bash cmake --build /tmp/b /tmp/b/bin/bpe_test...

Hello, This is an excellent implementation of BPE in C++ - would you be so kind as to add a License file? Love your work! Thank you, Brian

``` huggingface-tokenizer-in-cxx/tokenizer/bpe_test.cc:8:3: error: use of undeclared identifier 'assert' ``` Adding: ``` #include ``` In bpe.h fixes it. Not sure if that's the right fix or not. FYI.