sentencepiece
sentencepiece copied to clipboard
Character coverage, it's type is int but it aceepts a value between 0.98 and 2
In the arguments of sentence piece tokemizer, the type of character coverage is int. However, it accepts a value b/w 0.98 and 1
from speechbrain.tokenizers.SentencePiece import SentencePiece
save_folder='./'
output_neurons=60
train_csv='./pcm_nsc-ud-train_v2.csv'
token_type='bpe'
character_coverage=2
tokenizer = SentencePiece(
model_dir=save_folder,
model_type=token_type,
vocab_size=output_neurons,
annotation_train=train_csv,
annotation_read="wrd",
character_coverage=character_coverage,
user_defined_symbols="0,1,2,3,4,5,6,7,8,9"
)
character_coverage : int
| Amount of characters covered by the model, good defaults
| are: 0.9995 for languages with a rich character set like Japanese or
| Chinese and 1.0 for other languages with small character set.
| (default: 1.0)
Got this error
RuntimeError: Internal: [/Users/runner/work/sentencepiece/sentencepiece/src/trainer_interface.cc](http://localhost:8888/Users/runner/work/sentencepiece/sentencepiece/src/trainer_interface.cc)(69) [trainer_spec.character_coverage() >= 0.98 && trainer_spec.character_coverage() <= 1.0]
Should the type of character_coverage be corrected as what's mentioned here, or I am missing something here?