rust-tokenizers
rust-tokenizers copied to clipboard
Reading SentencePieceVocab from text file
I've created a SentencePiece model using Python which results in a .model
and a .vocab
file. It is not possible to create a SentencePieceVocab
from the later since Python does not seem to use protobuf but rather a plain text file. Here's an excerpt of my file:
<unk> 0
<s> 0
</s> 0
▁ -2.29038
s -3.10405
l -3.41047
I didn't find an option in the Python code for creating a protobuf vocab file so I wrote a parser. Unless I'm mistaken and did something wrong, would you like that code as a PR? I.e. something like:
impl SentencePieceVocab {
...
pub fn from_vocab_txt_file(path: &str) -> Result<SentencePieceVocab, TokenizerError> {
...
}
}
in rust-tokenizers/main/src/vocab/sentence_piece_vocab.rs
Hello @MikaelCall ,
This is a good idea - it would be great if you could contribute your code back to the community! I have a few questions on my side regarding this implementation:
- The user typically wants to create a
Tokenizer
that creates internally aVocab
. I believe handling this different format would need to be adapted to the tokenizers as well. Usually the Tokenizer contains both aSentencePieceModel
and aVocab
. These are for some tokenizers generated from the same file (for example XLNetTokenizer) - does this mean we need to add additional parsing capabilities for all of theVocab
of Sentencepiece-based tokenizers? (i.e., updating the::from_file
method ofXLNetVocab
)? - The loading of Sentencepiece files (from proto or text file) could be shared in 2 traits:
-
FromProto
exposing methodsgenerate_vocab_from_proto
(returning a HashMap) andgenerate_vocab_with_scores_from_proto
(returning a Trie) -
FromText
exposing methodsgenerate_vocab_from_text
andgenerate_vocab_with_scores_fromt_text
-
The Vocab
structs (e.g. XLNetVocab
, XLMRobertaVocab
,...) and SentencePieceModel
would implement these traits to load the files and create the intermediate HashMap required for their internal storage. Alternatively the traits could be arranged as TrieFromFile
and VocabFromFile
re-arranging the above-methods (Vocab
s would implement VocabFromFile
and SentencePieceModel
would implement TrieFromFile
)
- For the additional file support, there may be 2 ways to implement it:
- Separate public method
from_vocab_txt_file
as you suggested - Unique public method
from_file
that tries loading the file as a protobuf, and falls back to a text file loading as protobuf fails. The unique entry point would probably call 2 specialized functions to try loading the file, but would allow to keep the API unchanged. The advantage is that therust-bert
pipeline loading tokenizer file do not need to know which format the vocab is stored as. We may want to still exposefrom_vocab_txt_file
publicly, but the support for both format in a single loading method offers more consistent API. What do you think?
- Separate public method
Potential for a conversion utility
This is however possibly more complex than converting the file to a Protobuf. Maybe a Python-based library allowing conversion from text file to Proto (and the other way around) would be generally valuable. I believe the community may be interested in such a tool that could be more broadly applicable (e.g. to Python users). I haven't tried it, but maybe the following script gives an indication that something along the lines of:
import sentencepiece_model_pb2 as model
m = model.ModelProto()
tokens = load_tokens_and_score_from_text(filepath)
for token in tokens:
new_token = model.ModelProto().SentencePiece()
new_token.piece = token
new_token.score = 0
m.pieces.append(new_token)
with open('new.model', 'wb') as f:
f.write(m.SerializeToString())
Please let me know what you think
I'm not familiar with the design and use cases so I can unfortunately not give any useful input on how to update traits.
Concerning the additional file support and the 2 ways to implement it, I think that your second suggestion would be very easy to implement and I'd be willing to submit a PR for that as long as you think it is a clean solution that doesn't interfere with your design.
I also agree with you that a conversion tool could be useful or at least being able to choose the output format via optional arguments in the CLI current tools.
How to create a spiece.model
using vocab.txt
(example) now?
Hi @MikaelCall
Is there any chance you could share your parser code? You mentioned you had written it already in your first post. Thanks!
impl SentencePieceVocab {
...
pub fn from_vocab_txt_file(path: &str) -> Result<SentencePieceVocab, TokenizerError> {
...
}
}
It's quite some time since I looked at it. This is what I was able to dig up.
/// Read Vocab file for sentence piece tokenization
fn read_vocab_file(path: &str) -> Result<SentencePieceVocab, TokenizeError> {
let f = File::open(path).map_err(|e| {
TokenizerError::FileNotFound(format!("{} vocabulary file not found :{}", path, e))
})?;
let br = BufReader::new(f);
let mut values = HashMap::new();
for (index, line) in br.lines().enumerate() {
let line = match line {
Ok(value) => value,
Err(e) => {
return Err(TokenizerError::VocabularyParsingError(e.to_string()).into());
}
};
let token = line
.split_whitespace()
.next()
.ok_or_else(|| TokenizerError::VocabularyParsingError(line.clone()))?
.trim();
// println!("{} .. {} -> |{}|", index, line, token);
if let Some(_previous_value) = values.insert(token.to_owned(), index as i64) {
panic!("FIXME");
}
}
let mut special_values = HashMap::new();
let unknown_value = SentencePieceVocab::unknown_value();
SentencePieceVocab::_register_as_special_value(
unknown_value,
&values,
&mut special_values,
)?;
let indices = Self::swap_key_values(&values);
let special_indices = Self::swap_key_values(&special_values);
Ok(SentencePieceVocab {
values,
indices,
unknown_value,
special_values,
special_indices,
})
}