sentencepiece issues

Sign wheels in GitHub releases

Hi, I'm reaching out on behalf of the [Open Source Security Foundation (openssf.org)](https://openssf.org/). We work on improving the security of critical open source projects like yours. Together with [GitHub](https://github.blog/2022-04-07-slsa-3-compliance-with-github-actions/), we...

laurentsimon

Segmentation fault on Ubuntu with basic python test

4

This problem is happening with version `0.1.96`, I recently upgraded from `0.1.91` (this version was working fine). When making a basic test using Ubuntu 20.04 on GitHub, a segmentation fault...

johntmyers

symbol not found in flat namespace '__ZN13sentencepiece4util6StatusD1Ev'

7

I did pip install --no-cache-dir sentencepiece but when I try to import it in Python 3.9, it crashes with : ImportError: dlopen(/Users/olivier/miniforge3/lib/python3.9/site-packages/sentencepiece/_sentencepiece.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace '__ZN13sentencepiece4util6StatusD1Ev'...

emergix

How to create new model file with restricted vocabulary?

4

Similar to #474, I want to restrict my vocabulary, and then save a new model file that uses the restricted vocabulary. I tried to do this by saving a vocabulary,...

sshleifer

help wanted

feature request

user defined char set

4

Hi, Thanks for this wonderful toolkit you have built! If my understanding is right, this toolkit take all the letters and punctuations as the char set to merge, where each...

wenjie-p

feature request

pkg-config file is broken when CMAKE_INSTALL_{INCLUDE,LIB}DIR is absolute

As per title: `CMakeLists.txt` has ``` set(prefix ${CMAKE_INSTALL_PREFIX}) set(exec_prefix "\${prefix}") set(libdir "\${exec_prefix}/${CMAKE_INSTALL_LIBDIR}") set(includedir "\${prefix}/${CMAKE_INSTALL_INCLUDEDIR}") ``` and so can’t handle absolute paths in `CMAKE_INSTALL_{INCLUDE,LIB}DIR`. This leads to broken .pc files on...

alexshpilkin

Bug: can't co-exist with pytorch-lightning

5

I'm trying to train a T5 model with `transformers` library, which requires the `sentencepiece` library to tokenize sentence. But when I installed it with `pip install sentencepiece`, I can't import...

jordane95

Is the loss computation in UnigramTrainer correct?

1

When computing `logsum_alt`, the frequency of a removed piece is re-assigned to alternatives: https://github.com/google/sentencepiece/blob/ba7e11a17f606327d0652528d58d2dd8cd265c6f/src/unigram_model_trainer.cc#L389-L394 But the code uses `alternatives.size()` which, if I'm not mistaken, is always equal to `sentencepieces.size()`. Don't...

mbollmann

bug

Bug in BPE algorithm

3

I think the BPE algorithm is not working properly. This code snippet reproduces the bug. ``` import sentencepiece as spm vocab_size= 9 model_prefix = 'model' train_data_file = 'corpus.txt' text =...

xbelonogov

bug

bazel support for C++ API

1

Hello all, Thank you for developing sentencepiece library! I am using bazel and want to incorporate sentencepiece into my project and use the c++ API. I could not find any...

BBerabi

feature request

sentencepiece
sentencepiece copied to clipboard

Metadata

Sign wheels in GitHub releases

Segmentation fault on Ubuntu with basic python test

symbol not found in flat namespace '__ZN13sentencepiece4util6StatusD1Ev'

How to create new model file with restricted vocabulary?

user defined char set

pkg-config file is broken when CMAKE_INSTALL_{INCLUDE,LIB}DIR is absolute

Bug: can't co-exist with pytorch-lightning

Is the loss computation in UnigramTrainer correct?

Bug in BPE algorithm

bazel support for C++ API

← Metadata

Owner

Metadata

sentencepiece sentencepiece copied to clipboard

Metadata

← Metadata

Owner

Metadata

sentencepiece
sentencepiece copied to clipboard