transformers run_glue_no_trainer.py script crashes on Mistral model due to tokenizer issue

System Info

transformers version: 4.36.2
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.10.13
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.0
Accelerate version: 0.25.0
Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: DEEPSPEED - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 8 - machine_rank: 0 - num_machines: 1 - rdzv_backend: static - same_network: True - main_training_function: main - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []
PyTorch version (GPU?): 2.1.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
Jax version: 0.4.21
JaxLib version: 0.4.21
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help?

@ArthurZucker @younesbelkada @pacman100

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Check out the transformers repo, and run this command (on a large server with appropriately configured accelerate, so it won't OOM):

python run_glue_no_trainer.py --model_name_or_path mistralai/Mistral-7B-v0.1 --task_name sst2 --per_device_train_batch_size 4 --learning_rate 2e-5 --num_train_epochs 3 --output_dir /tmp/sst2

It will crash with this error and stack trace:

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/scratch/brr/run_glue.py", line 662, in <module>
    main()
  File "/scratch/brr/run_glue.py", line 545, in main
    for step, batch in enumerate(active_dataloader):
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in __iter__
    current_batch = next(dataloader_iter)
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/data/data_collator.py", line 249, in __call__
    batch = self.tokenizer.pad(
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3259, in pad
    padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2707, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
/scratch/miniconda3/envs/brr/lib/python3.10/tempfile.py:860: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmprbynkmzk'>
  _warnings.warn(warn_message, ResourceWarning)

Expected behavior

It should train without crashing.

Jan 16 '24 14:01 rosario-purple

Adding these lines seems to fix it, not sure if this is the best/most general solution though:

    tokenizer = AutoTokenizer.from_pretrained(
        args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code
    )
    tokenizer.pad_token = tokenizer.eos_token
    config.pad_token_id = tokenizer.pad_token_id
    model = AutoModelForSequenceClassification.from_pretrained(
        args.model_name_or_path,
        from_tf=bool(".ckpt" in args.model_name_or_path),
        config=config,
        ignore_mismatched_sizes=args.ignore_mismatched_sizes,
        trust_remote_code=args.trust_remote_code,
    )

Jan 16 '24 14:01 rosario-purple

Hi @rosario-purple, thanks for raising this issue!

The proposed fix is the recommended way to address this. Would you like to open a PR to add this to the script? This way you get the github contribution

Jan 16 '24 19:01 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Feb 16 '24 08:02 github-actions[bot]

@amyeroberts can i take this up?

Apr 09 '24 11:04 JINO-ROHIT

@JINO-ROHIT Sure!

Apr 09 '24 12:04 amyeroberts