huggingface-tokenizer-in-cxx issues

support additional special tokens

1

Hello @wangkuiyi , It seems this tokenizer only supports one special token "". Does it support other additional special tokens? For instatnce the ones we added in special_tokens_map.json, like `"",...

Derekglk

中文支持的情况

你好，貌似我们这个工程不支持中文切词？请问 merges.txt 和 vocab.txt 是什么关系呢？我有一份包含中文的 vocab.txt 文件，但是没有对应的 merges.txt

sgxu

no </w> char

it seems cannot support append char?

susht3

Hello, does your project support decoder for byte-level-bpe?

Alex-Songs

seems to be slow

3

Good work~ But I ran some tests and found this c++ implementation seems to be slow. Less than 10 tokens per millisecond. Any more tests or findings?

shenfe

RE2 does not support look-ahead

So, the C++ tokenizer generates a slightly different output than that of the HuggingFace tokenzer if the input text contains more than one successive whitespaces. ```bash cmake --build /tmp/b /tmp/b/bin/bpe_test...

wangkuiyi

Created MIT LICENSE

NaveenMittal0

missing LICENSE.md

Hello, This is an excellent implementation of BPE in C++ - would you be so kind as to add a License file? Love your work! Thank you, Brian

bpkeene

MacOS build fails, missing assert

``` huggingface-tokenizer-in-cxx/tokenizer/bpe_test.cc:8:3: error: use of undeclared identifier 'assert' ``` Adding: ``` #include ``` In bpe.h fixes it. Not sure if that's the right fix or not. FYI.

bluejack

huggingface-tokenizer-in-cxx
huggingface-tokenizer-in-cxx copied to clipboard

Metadata

support additional special tokens

中文支持的情况

no </w> char

Hello, does your project support decoder for byte-level-bpe?

seems to be slow

RE2 does not support look-ahead

Created MIT LICENSE

missing LICENSE.md

MacOS build fails, missing assert

← Metadata

Owner

Metadata

huggingface-tokenizer-in-cxx huggingface-tokenizer-in-cxx copied to clipboard

Metadata

← Metadata

Owner

Metadata

huggingface-tokenizer-in-cxx
huggingface-tokenizer-in-cxx copied to clipboard