tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Merges cannot handle tokens containing spaces.

Open Narsil opened this issue 2 years ago • 15 comments

This fixes this while keeping backward support. We don't want to merge that blindly.

Narsil avatar Feb 16 '22 21:02 Narsil

Hi. I found this PR through https://github.com/huggingface/tokenizers/issues/566.

How can I use the PR? I wasn't able to see the rust files in my tokenizers site-packages, and the README on the repo doesn't give instructions on building from source

nihirv avatar Nov 22 '22 20:11 nihirv

Here are the docs for installing from source:

https://huggingface.co/docs/tokenizers/installation#installation-from-sources

I rebased again main so this PR has all the features developped since. Please report if this suits your use case, it would push the need to merge. I hesitate to merge useless features, so if it has some use cases it's nice to know.

Narsil avatar Nov 23 '22 19:11 Narsil

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Thanks for the speedy response Narsil.

To clarify, this PR allows us to have merges in our tokenizer.json file which may have multiple whitespaces for each merge?

I'll play around with it tomorrow and let you know 👍

nihirv avatar Nov 24 '22 01:11 nihirv

Yes exactly, this allows any number of spaces or any other char in the merges.

Narsil avatar Nov 24 '22 08:11 Narsil

Wow that installation process was painless. I guess there is a first time for everything!

Anyway, I'm still facing the error that was there in the original issue: data did not match any variant of untagged enum ModelWrapper at line 24420 column 3. The merges in tokenizer.json are now a length-2 list of strings instead of a string - so I'm pretty sure your branch was used successfully.

The error is getting thrown at this line: fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) which is located in transformers/tokenization_utils_fast.py

nihirv avatar Nov 24 '22 11:11 nihirv

Could you share your serialized file somehow ? I could take a look.

Narsil avatar Nov 24 '22 13:11 Narsil

https://pastebin.com/paT6zQhh

nihirv avatar Nov 24 '22 15:11 nihirv

Hey man, did you get a chance to take a look?

nihirv avatar Nov 28 '22 13:11 nihirv

Thanks for reminding. I just did, it seems the merges you are referrring in your file are not present in the vocabulary making the merges unsound.

How did you create that file ?

Narsil avatar Nov 28 '22 14:11 Narsil

Here is how I'm testing this in Rust to get a better error message:

      #[test]
      // Ensure `BPE::from_file` works as expected.
      fn test_bpe_from_file2() {
          // Make sure we can instantiate a BPE model from the files.
          let data = std::fs::read_to_string("test.json").unwrap();
          let bpe: BPE = serde_json::from_str(&data).unwrap();
      }

Just add this in the rust source code and run cargo test to see how it does (update test.json to point to the "model" part of the JSON.

Making better error messages all the way into python is trickier because of the libs we use and how this library is structured (though it would be really cool to have).

Narsil avatar Nov 28 '22 14:11 Narsil

I'm trying to get BPE to learn the ideal character sequence to split on instead of splitting on whitespace first.

Something like this:

    tokenizer = Tokenizer(BPE())  # type: ignore
    # data has been preprocessed to add Ġ to the beginning of every word in the corpus
    tokenizer.normalizer = normalizers.Sequence([Lowercase(), normalizers.Replace("ġ", 'Ġ')])
    
    # Customize training
    trainer = BpeTrainer(vocab_size=5000, min_frequency=2,
                    show_progress=True,
                    special_tokens=[
                                    "<pad>",
                                    "<s>",
                                    "</s>",
                                    "<unk>",
                                    "<mask>",
    ])
    tokenizer.train(files=file_list_for_language, trainer=trainer)
    tokenizer_path = f"tokenizers/{language}/"
    os.makedirs(tokenizer_path, exist_ok=True)
    new_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

    new_tokenizer.save_pretrained(tokenizer_path)

I'm also doing some postprocessing which modifies the vocabulary directly. This may be where the discrepancy with the merges is coming from:

    with open(tokenizer_path+"tokenizer.json", "r") as f:
        data = json.load(f)
        model_vocab: Dict[str, int] = data["model"]["vocab"]
        postprocessed_vocab = {}
        for tok, idx in model_vocab.items():
            if idx < 6:
                postprocessed_vocab[tok] = idx
                continue
            tok = tok.strip()
            if tok.endswith("Ġ.##"):
                tok = tok[:-4].strip()
            if tok.endswith("Ġ"):
                tok = tok[:-1].strip()

            postprocessed_vocab[tok] = idx

I'm not entirely sure how I can "clean" my vocabulary and make the relevant changes in the merges file?

nihirv avatar Nov 29 '22 11:11 nihirv

Yes it's your postprocessing that is most likely causing the issue.

Narsil avatar Nov 29 '22 12:11 Narsil

Hi, can this please be merged as a different (de)serialization format? Several issues through years have encountered this issue and have used this PR to fix it. Making sure that this is a separate serialization format would be fully backwards compatible and explicitly alert people to use the different functions to (de)serialize the tokenizer.

ashutoshbsathe avatar Apr 08 '24 15:04 ashutoshbsathe