[Question] Is it possible to create a tokenizer from a config file?
Hey,
Is it possible to create a tokenizer from a config file similar to the rust library tokenizers. Example code in rust:
use tokenizers::Tokenizer;
use clap::Parser;
/// Simple CLI to tokenize input text using a Hugging Face tokenizer.json
#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
/// Path to tokenizer.json
#[arg(short, long)]
tokenizer: String,
/// Text to tokenize
#[arg(short, long)]
text: String,
}
fn main() {
let args = Args::parse();
// Load tokenizer
let tokenizer = Tokenizer::from_file(&args.tokenizer)
.expect("Failed to load tokenizer.json");
// Encode text
let encoding = tokenizer.encode(args.text, true)
.expect("Failed to encode text");
// Output token IDs
println!("{:?}", encoding.get_ids());
}
The config file has all the stages and parameters configured, such as added tokens, normalizers, pre_tokenizers, vocab, merges and so on.
At first glance this does not seem possible in Microsoft.ML.Tokenizers. It may be possible to create the tokenizer in code with the correct configuration using this Create method however it is not clear as the config has an array of normalizers and pretokenizers while the create method only expects one of each. Seems like I will need to implement my own normalizers and pretokenizers if I were to move forward with the Microsoft.ML.Tokenizers. I am new to this domain, so just wanted to confirm with experts if this is the case or if there are any suggestions.
Thank you.