run_glue_no_trainer.py script crashes on Mistral model due to tokenizer issue
System Info
transformersversion: 4.36.2- Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.0
- Accelerate version: 0.25.0
- Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: DEEPSPEED - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 8 - machine_rank: 0 - num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
- Jax version: 0.4.21
- JaxLib version: 0.4.21
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Who can help?
@ArthurZucker @younesbelkada @pacman100
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Check out the transformers repo, and run this command (on a large server with appropriately configured accelerate, so it won't OOM):
python run_glue_no_trainer.py --model_name_or_path mistralai/Mistral-7B-v0.1 --task_name sst2 --per_device_train_batch_size 4 --learning_rate 2e-5 --num_train_epochs 3 --output_dir /tmp/sst2
It will crash with this error and stack trace:
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
File "/scratch/brr/run_glue.py", line 662, in <module>
main()
File "/scratch/brr/run_glue.py", line 545, in main
for step, batch in enumerate(active_dataloader):
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in __iter__
current_batch = next(dataloader_iter)
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/data/data_collator.py", line 249, in __call__
batch = self.tokenizer.pad(
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3259, in pad
padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2707, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
/scratch/miniconda3/envs/brr/lib/python3.10/tempfile.py:860: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmprbynkmzk'>
_warnings.warn(warn_message, ResourceWarning)
Expected behavior
It should train without crashing.
Adding these lines seems to fix it, not sure if this is the best/most general solution though:
tokenizer = AutoTokenizer.from_pretrained(
args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code
)
tokenizer.pad_token = tokenizer.eos_token
config.pad_token_id = tokenizer.pad_token_id
model = AutoModelForSequenceClassification.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
ignore_mismatched_sizes=args.ignore_mismatched_sizes,
trust_remote_code=args.trust_remote_code,
)
Hi @rosario-purple, thanks for raising this issue!
The proposed fix is the recommended way to address this. Would you like to open a PR to add this to the script? This way you get the github contribution
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@amyeroberts can i take this up?
@JINO-ROHIT Sure!