transformers
transformers copied to clipboard
Support identity normalizer in SentencePiece model
What does this PR do?
SentencePiece can train a model with a specified normalizer (for example normalization_rule_name="nfkc"
).
https://github.com/google/sentencepiece/blob/master/doc/normalization.md
However, no normalization is done with normalization_rule_name="identity"
, and proto.normalizer_spec.precompiled_charsmap
in the SentencePiece model is empty.
Loading this model with AlbertTokenizerFast.from_pretrained
occurres the following error:
>>> tokenizer = AlbertTokenizerFast.from_pretrained('.')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained
return cls._from_pretrained(
File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1959, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/models/albert/tokenization_albert_fast.py", line 148, in __init__
super().__init__(
File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1162, in convert_slow_tokenizer
return converter_class(transformer_tokenizer).converted()
File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 503, in converted
tokenizer.normalizer = self.normalizer(self.proto)
File "/Users/nminami/.pyenv/versions/3.10.4/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 535, in normalizer
list_normalizers.append(normalizers.Precompiled(precompiled_charsmap))
Exception: Error while attempting to build Precompiled normalizer: Cannot parse precompiled_charsmap
This error is caused by passing empty bytes to normalizers.Precompiled
. So, this PR prevents the problem to check proto.normalizer_spec.name
before passing empty bytes.
How to reproduce this problem
OS/Arch: macOS/Apple Silicon
Python 3.10.4 (main, Jun 26 2022, 22:29:49) [Clang 13.0.0 (clang-1300.0.27.3)] on darwin
protobuf==3.19.0
sentencepiece==0.1.97
transformers==4.28.1
Save SentencePiece model using python/test/botchan.txt.
import sentencepiece as spm
spm.SentencePieceTrainer.train(input='python/test/botchan.txt', model_prefix='spiece', vocab_size=1000, normalization_rule_name='identity')
Read SentencePiece model using AlbertTokenizerFast.from_pretrained
.
from transformers import AlbertTokenizerFast
tokenizer = AlbertTokenizerFast.from_pretrained('.')
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
- I think this is a bug fix. So, no documentation updates required.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
cc @ArthurZucker
cc @Narsil if I am missing something (maybe the normalizers in rust should support identity type)
normalizer: None
should do nothing.
Most likely a case not handled by our current code, we probably need to check that the spec is set to indentity, and not even attempt to create the precompiled_charsmap
(since it's invalid and we already have a mecanism for identity)
@ArthurZucker Thank you for reviewing!
I fixed all issues related to empty precompiled_charsmap
referring to following code.
https://github.com/huggingface/transformers/blob/dc67da01829090ec92dfc24653242cf3f56d1a01/src/transformers/convert_slow_tokenizer.py#L625-L628
The current modification LGTM. I'm not sure why the test fail, maybe rebase ?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closing in favor of #24618