icetk icon indicating copy to clipboard operation
icetk copied to clipboard

A unified tokenization tool for Images, Chinese and English.

Results 7 icetk issues
Sort by recently updated
recently updated
newest added

Now, the sentencepiece_model_pb2.py is generated by the low version protobuf which is incompatible with many libs with high version protobuf. We can re-generate the `sentencepiece_model_pb2.py` with the high version of...

The version of protobuf seems to low and confilt with other python framework such as mlflow, may you upgrade the protobuf dependency

When I tried to install icetk by using `pip install icetk`, I could see icetk's version is 0.0.5. But when I go back to this code repo. I cannot find...

the tokenizer cant be hashed when using datasets.map function with num_proc >1 . https://github.com/THUDM/ChatGLM-6B/issues/286

Does icetk have a C++implementation version?

```py tokens = icetk.encode('你好世界!这里是 icetk。') for token in tokens: print(token, icetk.text_tokenizer.proto.pieces[token - 20000].piece) ``` ``` 20005 ▁ 94874 你好 84097 世界 20035 ! 94947 这里是 22881 ▁ice 35955 tk 83823...

1. Fix typos 2. Remove redundant whitespaces