rust-tokenizers icon indicating copy to clipboard operation
rust-tokenizers copied to clipboard

Reading SentencePieceVocab from text file

Open MikaelCall opened this issue 3 years ago • 5 comments

I've created a SentencePiece model using Python which results in a .model and a .vocab file. It is not possible to create a SentencePieceVocab from the later since Python does not seem to use protobuf but rather a plain text file. Here's an excerpt of my file:

<unk>	0
<s>	0
</s>	0
▁	-2.29038
s	-3.10405
l	-3.41047

I didn't find an option in the Python code for creating a protobuf vocab file so I wrote a parser. Unless I'm mistaken and did something wrong, would you like that code as a PR? I.e. something like:

impl SentencePieceVocab {
    ...
    
    pub fn from_vocab_txt_file(path: &str) -> Result<SentencePieceVocab, TokenizerError> { 
        ... 
    }
}

in rust-tokenizers/main/src/vocab/sentence_piece_vocab.rs

MikaelCall avatar Apr 19 '21 06:04 MikaelCall

Hello @MikaelCall ,

This is a good idea - it would be great if you could contribute your code back to the community! I have a few questions on my side regarding this implementation:

  • The user typically wants to create a Tokenizer that creates internally a Vocab. I believe handling this different format would need to be adapted to the tokenizers as well. Usually the Tokenizer contains both a SentencePieceModel and a Vocab. These are for some tokenizers generated from the same file (for example XLNetTokenizer) - does this mean we need to add additional parsing capabilities for all of the Vocab of Sentencepiece-based tokenizers? (i.e., updating the ::from_file method of XLNetVocab)?
  • The loading of Sentencepiece files (from proto or text file) could be shared in 2 traits:
    • FromProto exposing methods generate_vocab_from_proto (returning a HashMap) and generate_vocab_with_scores_from_proto (returning a Trie)
    • FromText exposing methods generate_vocab_from_text and generate_vocab_with_scores_fromt_text

The Vocab structs (e.g. XLNetVocab, XLMRobertaVocab,...) and SentencePieceModel would implement these traits to load the files and create the intermediate HashMap required for their internal storage. Alternatively the traits could be arranged as TrieFromFile and VocabFromFile re-arranging the above-methods (Vocabs would implement VocabFromFile and SentencePieceModel would implement TrieFromFile)

  • For the additional file support, there may be 2 ways to implement it:
    • Separate public method from_vocab_txt_file as you suggested
    • Unique public method from_file that tries loading the file as a protobuf, and falls back to a text file loading as protobuf fails. The unique entry point would probably call 2 specialized functions to try loading the file, but would allow to keep the API unchanged. The advantage is that the rust-bert pipeline loading tokenizer file do not need to know which format the vocab is stored as. We may want to still expose from_vocab_txt_file publicly, but the support for both format in a single loading method offers more consistent API. What do you think?

Potential for a conversion utility

This is however possibly more complex than converting the file to a Protobuf. Maybe a Python-based library allowing conversion from text file to Proto (and the other way around) would be generally valuable. I believe the community may be interested in such a tool that could be more broadly applicable (e.g. to Python users). I haven't tried it, but maybe the following script gives an indication that something along the lines of:

import sentencepiece_model_pb2 as model
m = model.ModelProto()

tokens = load_tokens_and_score_from_text(filepath)

for token in tokens:
    new_token = model.ModelProto().SentencePiece()
    new_token.piece = token
    new_token.score = 0
    m.pieces.append(new_token)

with open('new.model', 'wb') as f:
    f.write(m.SerializeToString())

Please let me know what you think

guillaume-be avatar Apr 19 '21 15:04 guillaume-be

I'm not familiar with the design and use cases so I can unfortunately not give any useful input on how to update traits.

Concerning the additional file support and the 2 ways to implement it, I think that your second suggestion would be very easy to implement and I'd be willing to submit a PR for that as long as you think it is a clean solution that doesn't interfere with your design.

I also agree with you that a conversion tool could be useful or at least being able to choose the output format via optional arguments in the CLI current tools.

MikaelCall avatar Apr 19 '21 19:04 MikaelCall

How to create a spiece.model using vocab.txt (example) now?

failable avatar Mar 02 '22 07:03 failable

Hi @MikaelCall

Is there any chance you could share your parser code? You mentioned you had written it already in your first post. Thanks!

impl SentencePieceVocab {
    ...
    
    pub fn from_vocab_txt_file(path: &str) -> Result<SentencePieceVocab, TokenizerError> { 
        ... 
    }
}

tobygodwin avatar Aug 31 '22 13:08 tobygodwin

It's quite some time since I looked at it. This is what I was able to dig up.

    /// Read Vocab file for sentence piece tokenization
    fn read_vocab_file(path: &str) -> Result<SentencePieceVocab, TokenizeError> {
        let f = File::open(path).map_err(|e| {
            TokenizerError::FileNotFound(format!("{} vocabulary file not found :{}", path, e))
        })?;
        let br = BufReader::new(f);
        let mut values = HashMap::new();

        for (index, line) in br.lines().enumerate() {
            let line = match line {
                Ok(value) => value,
                Err(e) => {
                    return Err(TokenizerError::VocabularyParsingError(e.to_string()).into());
                }
            };

            let token = line
                .split_whitespace()
                .next()
                .ok_or_else(|| TokenizerError::VocabularyParsingError(line.clone()))?
                .trim();

            // println!("{} .. {} -> |{}|", index, line, token);

            if let Some(_previous_value) = values.insert(token.to_owned(), index as i64) {
                panic!("FIXME");
            }
        }

        let mut special_values = HashMap::new();
        let unknown_value = SentencePieceVocab::unknown_value();
        SentencePieceVocab::_register_as_special_value(
            unknown_value,
            &values,
            &mut special_values,
        )?;

        let indices = Self::swap_key_values(&values);
        let special_indices = Self::swap_key_values(&special_values);

        Ok(SentencePieceVocab {
            values,
            indices,
            unknown_value,
            special_values,
            special_indices,
        })
    }

MikaelCall avatar Aug 31 '22 15:08 MikaelCall