icetk
icetk copied to clipboard
A unified tokenization tool for Images, Chinese and English.
Now, the sentencepiece_model_pb2.py is generated by the low version protobuf which is incompatible with many libs with high version protobuf. We can re-generate the `sentencepiece_model_pb2.py` with the high version of...
The version of protobuf seems to low and confilt with other python framework such as mlflow, may you upgrade the protobuf dependency
When I tried to install icetk by using `pip install icetk`, I could see icetk's version is 0.0.5. But when I go back to this code repo. I cannot find...
the tokenizer cant be hashed when using datasets.map function with num_proc >1 . https://github.com/THUDM/ChatGLM-6B/issues/286
Does icetk have a C++implementation version?
```py tokens = icetk.encode('你好世界!这里是 icetk。') for token in tokens: print(token, icetk.text_tokenizer.proto.pieces[token - 20000].piece) ``` ``` 20005 ▁ 94874 你好 84097 世界 20035 ! 94947 这里是 22881 ▁ice 35955 tk 83823...
1. Fix typos 2. Remove redundant whitespaces