tokenizers
tokenizers copied to clipboard
thread '<unnamed>' panicked at /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs:428:22: AddedVocabulary bad split
randomly hit this and then everything hung. Using tokenizers 0.15.1
thread '<unnamed>' panicked at /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs:428:22: AddedVocabulary bad split note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Happened again, every time I run h2oGPT test suite.
can you share a reproducer ? No idea what h2oGPT is
@ArthurZucker Here is a repro:
tokenizers==0.15.1 transformers==4.38.1
from transformers import AutoTokenizer
base_model = 'lmsys/fastchat-t5-3b-v1.0'
tokenizer = AutoTokenizer.from_pretrained(base_model)
x = """A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the questions by the human.
### Human: Got any creative birthday ideas for a 10 year old?
### Assistant: Of course! Here are some creative birthday party ideas for a 10-year-old:
1. Treasure Hunt: Organize a treasure hunt in your backyard or nearby park. Create clues and riddles for the kids to solve, leading them to hidden treasures and surprises.
2. Science Party: Plan a science-themed party where kids can engage in fun and interactive experiments. You can set up different stations with activities like making slime, erupting volcanoes, or creating simple chemical reactions.
3. Outdoor Movie Night: Set up a backyard movie night with a projector and a large screen or white sheet. Create a cozy seating area with blankets and pillows, and serve popcorn and snacks while the kids enjoy a favorite movie under the stars.
4. DIY Crafts Party: Arrange a craft party where kids can unleash their creativity. Provide a variety of craft supplies like beads, paints, and fabrics, and let them create their own unique masterpieces to take home as party favors.
5. Sports Olympics: Host a mini Olympics event with various sports and games. Set up different stations for activities like sack races, relay races, basketball shooting, and obstacle courses. Give out medals or certificates to the participants.
6. Cooking Party: Have a cooking-themed party where the kids can prepare their own mini pizzas, cupcakes, or cookies. Provide toppings, frosting, and decorating supplies, and let them get hands-on in the kitchen.
7. Superhero Training Camp: Create a superhero-themed party where the kids can engage in fun training activities. Set up an obstacle course, have them design their own superhero capes or masks, and organize superhero-themed games and challenges.
8. Outdoor Adventure: Plan an outdoor adventure party at a local park or nature reserve. Arrange activities like hiking, nature scavenger hunts, or a picnic with games. Encourage exploration and appreciation for the outdoors.
Remember to tailor the activities to the interests and preferences of the birthday child. Have a great celebration!
### Human: Who are you?
### Assistant:"""
tokenizer.encode(x)
gives:
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
thread '<unnamed>' panicked at /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs:428:22:
AddedVocabulary bad split
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "/home/jon/h2ogpt/testfast1.py", line 21, in <module>
tokenizer.encode(x)
File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2600, in encode
encoded_inputs = self.encode_plus(
File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3008, in encode_plus
return self._encode_plus(
File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 576, in _encode_plus
batched_output = self._batch_encode_plus(
File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 504, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: AddedVocabulary bad split
This continues to be problem for huggingface_hub-0.21.4 tokenizers-0.15.2
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I'm facing this error too
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hey all! sorry I'll have a look!
I am pretty sure that 32106: AddedToken(" ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
is an issue:
tokenizer.encode("hey .")
will produce this issue
If I do AutoTokenizer.from_pretrained("path-to-model", added_tokens_decoder=None)
then this is no longer the case
If I do
AutoTokenizer.from_pretrained("path-to-model", added_tokens_decoder=None)
then this is no longer the case
I have solved, thank you.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.