transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Error while trying to resume from checkpoint

Open sstoia opened this issue 2 years ago • 7 comments

System Info

  • transformers version: 4.28.1
  • Platform: Linux-3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.9.16
  • Huggingface_hub version: 0.19.3
  • Safetensors version: 0.4.0
  • PyTorch version (GPU?): 1.13.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

@muellerzr @pacman100

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Some weeks ago I started the pre-training of longformer model for a specific task. The size of the dataset is pretty big so I had to load the dataset with streaming, as it is impossible to load all the dataset at once. Because of this, the evaluation strategy is based on steps instead of epochs.

I set the max_steps parameter from TrainingArguments as _5e+5, which is equivalent to 1.05 epochs from the dataset. The problem appears when using the resume_from_checkpoint parameter from Trainer as the path to the latest checkpoint saved from the Trainer.


trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator = data_collator,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics,
        compute_metrics=compute_metrics,
        callbacks = [EarlyStoppingCallback(PATIENCE)]
        )
    trainer.train(resume_from_checkpoint = '<path_to_checkpoint>')


The error obtained is the next one:

0%|          | 0/500000.0 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/mnt/beegfs/sstoia/proyectos/experian/run_mlm_restart_from_checkpoint.py", line 154, in <module>
    trainer.train(resume_from_checkpoint = '/mnt/beegfs/sstoia/proyectos/experian/exp_longformer-base_4096/checkpoint-484000')
  File "/mnt/beegfs/sstoia/.conda/envs/poesia/lib/python3.9/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/mnt/beegfs/sstoia/.conda/envs/poesia/lib/python3.9/site-packages/transformers/trainer.py", line 1851, in _inner_training_loop
    for epoch in range(epochs_trained):
TypeError: 'float' object cannot be interpreted as an integer

Expected behavior

The trainer should load the checkpoint and continue the pre-training from it.

sstoia avatar Nov 21 '23 12:11 sstoia

cc @muellerzr

ArthurZucker avatar Nov 21 '23 14:11 ArthurZucker

gently pinging @muellerzr or @SunMarc if he has time look into it!

ArthurZucker avatar Jan 16 '24 08:01 ArthurZucker

Gentle ping @muellerzr @SunMarc

amyeroberts avatar Mar 08 '24 11:03 amyeroberts

@sstoia could you provide a full reproducer that we can just run and get the error??

ArthurZucker avatar Mar 25 '24 03:03 ArthurZucker

same to me

TypeError                                 Traceback (most recent call last)
Cell In[15], [line 32](vscode-notebook-cell:?execution_count=15&line=32)
     [29](vscode-notebook-cell:?execution_count=15&line=29) # silence the warnings. re-enable for inference!
     [30](vscode-notebook-cell:?execution_count=15&line=30) model.config.use_cache = False
---> [32](vscode-notebook-cell:?execution_count=15&line=32) trainer.train()
     [33](vscode-notebook-cell:?execution_count=15&line=33) model.save_pretrained("llama-7b-int4-dolly")

File [/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1539](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1539), in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   [1537](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1537)         hf_hub_utils.enable_progress_bars()
   [1538](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1538) else:
-> [1539](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1539)     return inner_training_loop(
   [1540](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1540)         args=args,
   [1541](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1541)         resume_from_checkpoint=resume_from_checkpoint,
   [1542](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1542)         trial=trial,
   [1543](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1543)         ignore_keys_for_eval=ignore_keys_for_eval,
   [1544](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1544)     )

File [/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1808](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1808), in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   [1805](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1805)             _ = list(sampler)
   [1807](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1807) total_batched_samples = 0
-> [1808](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1808) for epoch in range(epochs_trained, num_train_epochs):
   [1809](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1809)     epoch_iterator = train_dataloader
   [1810](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1810)     if hasattr(epoch_iterator, "set_epoch"):

TypeError: 'float' object cannot be interpreted as an integer

imneov avatar Apr 13 '24 14:04 imneov

@imneov a repr would be great for us to work with :)

muellerzr avatar Apr 13 '24 20:04 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 08 '24 08:05 github-actions[bot]

Bumping up this issue because I encountered it with the latest transformers version (4.43.3) when trying to resume training. The error comes from the line for epoch in range(epochs_trained, num_train_epochs): in file trainer.py. For whatever reason, epochs_trained is float instead of integer.

I manually edited the code in /usr/local/lib/python3.10/dist-packages/transformers/trainer.py to overcome the problem:

        print(f'epochs_trained: {epochs_trained}') # in my case, this gave 1.0
        print(f'num_train_epochs: {num_train_epochs}')
        if type(epochs_trained) != int:
            epochs_trained = int(epochs_trained) # now it should be integer, i.e., 1
        if type(num_train_epochs) != int:
            num_train_epochs = int(num_train_epochs)
        for epoch in range(epochs_trained, num_train_epochs):

Now I can resume training.

teddy-f-47 avatar Jul 28 '24 13:07 teddy-f-47

Would you like to open a PR to fix this @teddy-f-47 ?

LysandreJik avatar Jul 29 '24 07:07 LysandreJik