sentencepiece icon indicating copy to clipboard operation
sentencepiece copied to clipboard

Character coverage, it's type is int but it aceepts a value between 0.98 and 2

Open minniekabra opened this issue 9 months ago • 0 comments

In the arguments of sentence piece tokemizer, the type of character coverage is int. However, it accepts a value b/w 0.98 and 1

from speechbrain.tokenizers.SentencePiece import SentencePiece


save_folder='./'
output_neurons=60
train_csv='./pcm_nsc-ud-train_v2.csv'
token_type='bpe'
character_coverage=2

tokenizer = SentencePiece(
        model_dir=save_folder,
    model_type=token_type,
        vocab_size=output_neurons,
        annotation_train=train_csv,
        annotation_read="wrd",
        character_coverage=character_coverage,
    user_defined_symbols="0,1,2,3,4,5,6,7,8,9"
    )
 character_coverage : int
 |      Amount of characters covered by the model, good defaults
 |      are: 0.9995 for languages with a rich character set like Japanese or
 |      Chinese and 1.0 for other languages with small character set.
 |      (default: 1.0)

Got this error

RuntimeError: Internal: [/Users/runner/work/sentencepiece/sentencepiece/src/trainer_interface.cc](http://localhost:8888/Users/runner/work/sentencepiece/sentencepiece/src/trainer_interface.cc)(69) [trainer_spec.character_coverage() >= 0.98 && trainer_spec.character_coverage() <= 1.0]

Should the type of character_coverage be corrected as what's mentioned here, or I am missing something here?

minniekabra avatar May 09 '25 14:05 minniekabra