tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

missing Rust version of `from_pretrained`

Open Zhen-hao opened this issue 5 years ago • 9 comments

Hi, It is not clear from the README files and code what's the Rust version of the following Python code.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Thanks in advance!

Zhen-hao avatar Aug 08 '20 20:08 Zhen-hao

here is my own attempt to reproduce the Python code in Rust based on my study of the two projects.

use tokenizers::models::wordpiece::WordPiece;
use tokenizers::tokenizer::{Result, Tokenizer};
use tokenizers::decoders::wordpiece::WordPiece as WordPieceDecoder;
use tokenizers::normalizers::bert::BertNormalizer;
use tokenizers::pre_tokenizers::bert::BertPreTokenizer;


fn main() -> Result<()> {
   // download from "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt";
    let input_file = "./bert-base-cased-vocab.txt";

    let wordpiece_builder = WordPiece::from_files(input_file);
    let wordpiece = wordpiece_builder
        .build()?;

    let mut tokenizer = Tokenizer::new(Box::new(wordpiece));

    //tokenizer.with_normalizer(Box::new(BertNormalizer::default()));
    tokenizer.with_normalizer(Box::new(BertNormalizer::new(true, true, None, false)));
    tokenizer.with_pre_tokenizer(Box::new(BertPreTokenizer));
    tokenizer.with_decoder(Box::new(WordPieceDecoder::default()));
    let sep = tokenizer.get_model().token_to_id("[SEP]").unwrap();
    let cls = tokenizer.get_model().token_to_id("[CLS]").unwrap();
    tokenizer.with_post_processor(Box::new(BertProcessing::new(
        (String::from("[SEP]"), sep),
        (String::from("[CLS]"), cls),
    )));

    let encoding = tokenizer.encode("I can feel the magic, can you?", true)?;
    println!("encoding length is {}", encoding.len());
    println!("{:?}", encoding.get_tokens());
    println!("{:?}", encoding.get_ids());  

   Ok(())

output:

encoding length is 11
["[CLS]", "I", "can", "feel", "the", "magic", ",", "can", "you", "?", "[SEP]"]
[101, 146, 1169, 1631, 1103, 3974, 117, 1169, 1128, 136, 102]

It would be great if someone can tell me if this makes sense or not.

Zhen-hao avatar Aug 10 '20 20:08 Zhen-hao

Hi, It is not clear from the README files and code what's the Rust version of the following Python code.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Thanks in advance!

As far as I am concerned, there is no existence of from_pretrained method, because I didn't get a result when searched around the source code in tokenizers dir.

jacklanda avatar Nov 19 '21 07:11 jacklanda

Hi @jacklanda,

How did you check ? https://github.com/huggingface/tokenizers/search?p=2&q=pretrained

Also, to use this, you need to build from source.

Narsil avatar Nov 19 '21 08:11 Narsil

I use a command find . | xargs grep -ri to search the target pattern I want(here it's from_pretrained). And I do this in this cache dir in the image below: 图片 However, it seems that this directory I do searching is not complete, right? It misses many files which shows in this url: https://github.com/huggingface/tokenizers/search?p=2&q=pretrained If it's true, I make a wrong judgement obviously :(

jacklanda avatar Nov 19 '21 15:11 jacklanda

You are using the last version published on crates, which is not master.

This hasn't been released yet.

Narsil avatar Nov 19 '21 15:11 Narsil

Got it, thanks for correcting my misunderstanding 👍🏻

jacklanda avatar Nov 19 '21 15:11 jacklanda

Support for from_pretrained appears to have been dropped in the latest version of Rust tokenizers (0.14.1) where it had been supported in version 0.13.3. See the below: version_0141 version_0133 I didn't see a note regarding support for from_pretrained being dropped in the release notes so figured I would raise the issue here. I also verified that running from_pretrained calls compile and run successfully when using version 0.13.3 and fail on version 0.14.1 when using tokenizers from crates.io.

bpatter5 avatar Nov 10 '23 11:11 bpatter5

Updating to reflect that after reviewing the Tokenizer implementation on the 0.14.1 branch you need to add the http feature when installing via cargo add: cargo add [email protected] --features http to get support for from_pretrained.

bpatter5 avatar Nov 10 '23 19:11 bpatter5