sentencepiece
sentencepiece copied to clipboard
Unsupervised text tokenizer for Neural Network-based text generation.
I need to use sentencepiece for tokenization, and I also need OpenVINO for NLP task inference. I am using vcpkg to manage both sentencepiece and OpenVINO. The protobuf for OpenVINO...
When building with C++ '20. Error due to default of "-1" on L221-222 ``` constexpr unicode_script::ScriptType kAnyType = static_cast(-1); ```
I have some id values ββand I want to train them with bpe.The following is an example of the id value. ``` 26865, 5412, 26865, 26865, 26865, 26865, 5412, 5412,...
I think there is a bug in calculation of max_score in unigram_model.cc: https://github.com/google/sentencepiece/blob/6225e08edb2577757163b3f5dbba4c0b670ef445/src/unigram_model.cc#L657-L664 As FLT_MIN is a very small positive number (on my system it's 1.17549435e-38) and token scores are...
Hi thanks for your great work on this. I noticed a subtle issue when playing with synthetic examples. The bpe algorithm works as expected but the unigram algorithm does not...
spm.SentencePieceTrainer.train('--input=dict.ja.txt --model_prefix=m --vocab_size=27034') this line is showing the error
I recently encountered some compatibility issue when using `sentencepiece v0.2.0` together with latest `transformers` and `tensorflow` packages. When I ran some Python script that imports `AutoProcessor` class from `transformers`, the...
Bumps the github-actions group with 3 updates in the / directory: [actions/upload-artifact](https://github.com/actions/upload-artifact), [actions/checkout](https://github.com/actions/checkout) and [actions/setup-python](https://github.com/actions/setup-python). Updates `actions/upload-artifact` from 3 to 4 Release notes Sourced from actions/upload-artifact's releases. v4.0.0 What's Changed...
Bumps the build-time-deps group with 3 updates in the /.github/workflows/requirements directory: [cibuildwheel](https://github.com/pypa/cibuildwheel), [pytest](https://github.com/pytest-dev/pytest) and [setuptools](https://github.com/pypa/setuptools). Updates `cibuildwheel` from 2.19.2 to 2.21.1 Release notes Sourced from cibuildwheel's releases. Version 2.21.1 π...
Using sentencepiece 0.1.99 in python 3.11.10, an out of range may cause crashes depending on which other valid inputs are part of the batch: ``` >>> tkn.load(str(Path("gemma2-9b") / "tokenizer.model")) True...