transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Support identity normalizer in SentencePiece model

Open chlorochrule opened this issue 1 year ago • 6 comments

What does this PR do?

SentencePiece can train a model with a specified normalizer (for example normalization_rule_name="nfkc"). https://github.com/google/sentencepiece/blob/master/doc/normalization.md

However, no normalization is done with normalization_rule_name="identity", and proto.normalizer_spec.precompiled_charsmap in the SentencePiece model is empty. Loading this model with AlbertTokenizerFast.from_pretrained occurres the following error:

>>> tokenizer = AlbertTokenizerFast.from_pretrained('.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained
    return cls._from_pretrained(
  File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1959, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/models/albert/tokenization_albert_fast.py", line 148, in __init__
    super().__init__(
  File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1162, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 503, in converted
    tokenizer.normalizer = self.normalizer(self.proto)
  File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 535, in normalizer
    list_normalizers.append(normalizers.Precompiled(precompiled_charsmap))
Exception: Error while attempting to build Precompiled normalizer: Cannot parse precompiled_charsmap

This error is caused by passing empty bytes to normalizers.Precompiled. So, this PR prevents the problem to check proto.normalizer_spec.name before passing empty bytes.

How to reproduce this problem

OS/Arch: macOS/Apple Silicon
Python 3.10.4 (main, Jun 26 2022, 22:29:49) [Clang 13.0.0 (clang-1300.0.27.3)] on darwin

protobuf==3.19.0
sentencepiece==0.1.97
transformers==4.28.1

Save SentencePiece model using python/test/botchan.txt.

import sentencepiece as spm
spm.SentencePieceTrainer.train(input='python/test/botchan.txt', model_prefix='spiece', vocab_size=1000, normalization_rule_name='identity')

Read SentencePiece model using AlbertTokenizerFast.from_pretrained.

from transformers import AlbertTokenizerFast
tokenizer = AlbertTokenizerFast.from_pretrained('.')

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
    • I think this is a bug fix. So, no documentation updates required.
  • [ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

chlorochrule avatar Apr 17 '23 16:04 chlorochrule

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

cc @ArthurZucker

amyeroberts avatar Apr 17 '23 17:04 amyeroberts

cc @Narsil if I am missing something (maybe the normalizers in rust should support identity type)

normalizer: None should do nothing.

Most likely a case not handled by our current code, we probably need to check that the spec is set to indentity, and not even attempt to create the precompiled_charsmap (since it's invalid and we already have a mecanism for identity)

Narsil avatar May 25 '23 13:05 Narsil

@ArthurZucker Thank you for reviewing! I fixed all issues related to empty precompiled_charsmap referring to following code. https://github.com/huggingface/transformers/blob/dc67da01829090ec92dfc24653242cf3f56d1a01/src/transformers/convert_slow_tokenizer.py#L625-L628

chlorochrule avatar Jun 01 '23 17:06 chlorochrule

The current modification LGTM. I'm not sure why the test fail, maybe rebase ?

Narsil avatar Jun 05 '23 14:06 Narsil

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 29 '23 15:06 github-actions[bot]

Closing in favor of #24618

ArthurZucker avatar Jul 04 '23 00:07 ArthurZucker