sentencepiece icon indicating copy to clipboard operation
sentencepiece copied to clipboard

Unsupervised text tokenizer for Neural Network-based text generation.

Results 102 sentencepiece issues
Sort by recently updated
recently updated
newest added

Hi, I'm reaching out on behalf of the [Open Source Security Foundation (openssf.org)](https://openssf.org/). We work on improving the security of critical open source projects like yours. Together with [GitHub](https://github.blog/2022-04-07-slsa-3-compliance-with-github-actions/), we...

This problem is happening with version `0.1.96`, I recently upgraded from `0.1.91` (this version was working fine). When making a basic test using Ubuntu 20.04 on GitHub, a segmentation fault...

I did pip install --no-cache-dir sentencepiece but when I try to import it in Python 3.9, it crashes with : ImportError: dlopen(/Users/olivier/miniforge3/lib/python3.9/site-packages/sentencepiece/_sentencepiece.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace '__ZN13sentencepiece4util6StatusD1Ev'...

Similar to #474, I want to restrict my vocabulary, and then save a new model file that uses the restricted vocabulary. I tried to do this by saving a vocabulary,...

help wanted
feature request

Hi, Thanks for this wonderful toolkit you have built! If my understanding is right, this toolkit take all the letters and punctuations as the char set to merge, where each...

feature request

As per title: `CMakeLists.txt` has ``` set(prefix ${CMAKE_INSTALL_PREFIX}) set(exec_prefix "\${prefix}") set(libdir "\${exec_prefix}/${CMAKE_INSTALL_LIBDIR}") set(includedir "\${prefix}/${CMAKE_INSTALL_INCLUDEDIR}") ``` and so can’t handle absolute paths in `CMAKE_INSTALL_{INCLUDE,LIB}DIR`. This leads to broken .pc files on...

I'm trying to train a T5 model with `transformers` library, which requires the `sentencepiece` library to tokenize sentence. But when I installed it with `pip install sentencepiece`, I can't import...

When computing `logsum_alt`, the frequency of a removed piece is re-assigned to alternatives: https://github.com/google/sentencepiece/blob/ba7e11a17f606327d0652528d58d2dd8cd265c6f/src/unigram_model_trainer.cc#L389-L394 But the code uses `alternatives.size()` which, if I'm not mistaken, is always equal to `sentencepieces.size()`. Don't...

bug

I think the BPE algorithm is not working properly. This code snippet reproduces the bug. ``` import sentencepiece as spm vocab_size= 9 model_prefix = 'model' train_data_file = 'corpus.txt' text =...

bug

Hello all, Thank you for developing sentencepiece library! I am using bazel and want to incorporate sentencepiece into my project and use the c++ API. I could not find any...

feature request