icetk
icetk copied to clipboard

Published 20 hours ago •

→

Metadata

A unified tokenization tool for Images, Chinese and English.

Reame
Issues

Results 7 icetk issues

Sort by recently updated

Can you offer the `sentencepiece_model.proto`?

Now, the sentencepiece_model_pb2.py is generated by the low version protobuf which is incompatible with many libs with high version protobuf. We can re-generate the `sentencepiece_model_pb2.py` with the high version of...

version of protobuf is too low and confilt with other python framework

2

comment

The version of protobuf seems to low and confilt with other python framework such as mlflow, may you upgrade the protobuf dependency

Please add tags to this repo corresponding to versions you had published to official pip website

When I tried to install icetk by using `pip install icetk`, I could see icetk's version is 0.0.5. But when I go back to this code repo. I cannot find...

Tokenizer cant be hashed when using datastes.map function

2

comment

the tokenizer cant be hashed when using datasets.map function with num_proc >1 . https://github.com/THUDM/ChatGLM-6B/issues/286

Does icetk have a C++implementation version?

Does icetk have a C++implementation version?

what‘s the meaning of token 20005?

```py tokens = icetk.encode('你好世界！这里是 icetk。') for token in tokens: print(token, icetk.text_tokenizer.proto.pieces[token - 20000].piece) ``` ``` 20005 ▁ 94874 你好 84097 世界 20035 ! 94947 这里是 22881 ▁ice 35955 tk 83823...

Fix format

1. Fix typos 2. Remove redundant whitespaces

About

A unified tokenization tool for Images, Chinese and English.

transformer

tokenization

145

Stars

16

Forks

Watchers

Owner

← Metadata

145

Stars

16

Forks

Watchers

Owner

Metadata

A unified tokenization tool for Images, Chinese and English.