tokenizers
tokenizers copied to clipboard
Merges cannot handle tokens containing spaces.
This fixes this while keeping backward support. We don't want to merge that blindly.
Hi. I found this PR through https://github.com/huggingface/tokenizers/issues/566.
How can I use the PR? I wasn't able to see the rust files in my tokenizers site-packages, and the README on the repo doesn't give instructions on building from source
Here are the docs for installing from source:
https://huggingface.co/docs/tokenizers/installation#installation-from-sources
I rebased again main
so this PR has all the features developped since. Please report if this suits your use case, it would push the need to merge. I hesitate to merge useless features, so if it has some use cases it's nice to know.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
Thanks for the speedy response Narsil.
To clarify, this PR allows us to have merges in our tokenizer.json file which may have multiple whitespaces for each merge?
I'll play around with it tomorrow and let you know 👍
Yes exactly, this allows any number of spaces or any other char in the merges.
Wow that installation process was painless. I guess there is a first time for everything!
Anyway, I'm still facing the error that was there in the original issue: data did not match any variant of untagged enum ModelWrapper at line 24420 column 3. The merges in tokenizer.json are now a length-2 list of strings instead of a string - so I'm pretty sure your branch was used successfully.
The error is getting thrown at this line: fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
which is located in transformers/tokenization_utils_fast.py
Could you share your serialized file somehow ? I could take a look.
https://pastebin.com/paT6zQhh
Hey man, did you get a chance to take a look?
Thanks for reminding. I just did, it seems the merges you are referrring in your file are not present in the vocabulary making the merges unsound.
How did you create that file ?
Here is how I'm testing this in Rust to get a better error message:
#[test]
// Ensure `BPE::from_file` works as expected.
fn test_bpe_from_file2() {
// Make sure we can instantiate a BPE model from the files.
let data = std::fs::read_to_string("test.json").unwrap();
let bpe: BPE = serde_json::from_str(&data).unwrap();
}
Just add this in the rust source code and run cargo test
to see how it does (update test.json
to point to the "model" part of the JSON.
Making better error messages all the way into python is trickier because of the libs we use and how this library is structured (though it would be really cool to have).
I'm trying to get BPE to learn the ideal character sequence to split on instead of splitting on whitespace first.
Something like this:
tokenizer = Tokenizer(BPE()) # type: ignore
# data has been preprocessed to add Ġ to the beginning of every word in the corpus
tokenizer.normalizer = normalizers.Sequence([Lowercase(), normalizers.Replace("ġ", 'Ġ')])
# Customize training
trainer = BpeTrainer(vocab_size=5000, min_frequency=2,
show_progress=True,
special_tokens=[
"<pad>",
"<s>",
"</s>",
"<unk>",
"<mask>",
])
tokenizer.train(files=file_list_for_language, trainer=trainer)
tokenizer_path = f"tokenizers/{language}/"
os.makedirs(tokenizer_path, exist_ok=True)
new_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
new_tokenizer.save_pretrained(tokenizer_path)
I'm also doing some postprocessing which modifies the vocabulary directly. This may be where the discrepancy with the merges is coming from:
with open(tokenizer_path+"tokenizer.json", "r") as f:
data = json.load(f)
model_vocab: Dict[str, int] = data["model"]["vocab"]
postprocessed_vocab = {}
for tok, idx in model_vocab.items():
if idx < 6:
postprocessed_vocab[tok] = idx
continue
tok = tok.strip()
if tok.endswith("Ġ.##"):
tok = tok[:-4].strip()
if tok.endswith("Ġ"):
tok = tok[:-1].strip()
postprocessed_vocab[tok] = idx
I'm not entirely sure how I can "clean" my vocabulary and make the relevant changes in the merges file?
Yes it's your postprocessing that is most likely causing the issue.
Hi, can this please be merged as a different (de)serialization format? Several issues through years have encountered this issue and have used this PR to fix it. Making sure that this is a separate serialization format would be fully backwards compatible and explicitly alert people to use the different functions to (de)serialize the tokenizer.