Error while trying to resume from checkpoint
System Info
transformersversion: 4.28.1- Platform: Linux-3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.9.16
- Huggingface_hub version: 0.19.3
- Safetensors version: 0.4.0
- PyTorch version (GPU?): 1.13.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help?
@muellerzr @pacman100
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Some weeks ago I started the pre-training of longformer model for a specific task. The size of the dataset is pretty big so I had to load the dataset with streaming, as it is impossible to load all the dataset at once. Because of this, the evaluation strategy is based on steps instead of epochs.
I set the max_steps parameter from TrainingArguments as _5e+5, which is equivalent to 1.05 epochs from the dataset. The problem appears when using the resume_from_checkpoint parameter from Trainer as the path to the latest checkpoint saved from the Trainer.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator = data_collator,
preprocess_logits_for_metrics=preprocess_logits_for_metrics,
compute_metrics=compute_metrics,
callbacks = [EarlyStoppingCallback(PATIENCE)]
)
trainer.train(resume_from_checkpoint = '<path_to_checkpoint>')
The error obtained is the next one:
0%| | 0/500000.0 [00:00<?, ?it/s]Traceback (most recent call last):
File "/mnt/beegfs/sstoia/proyectos/experian/run_mlm_restart_from_checkpoint.py", line 154, in <module>
trainer.train(resume_from_checkpoint = '/mnt/beegfs/sstoia/proyectos/experian/exp_longformer-base_4096/checkpoint-484000')
File "/mnt/beegfs/sstoia/.conda/envs/poesia/lib/python3.9/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/mnt/beegfs/sstoia/.conda/envs/poesia/lib/python3.9/site-packages/transformers/trainer.py", line 1851, in _inner_training_loop
for epoch in range(epochs_trained):
TypeError: 'float' object cannot be interpreted as an integer
Expected behavior
The trainer should load the checkpoint and continue the pre-training from it.
cc @muellerzr
gently pinging @muellerzr or @SunMarc if he has time look into it!
Gentle ping @muellerzr @SunMarc
@sstoia could you provide a full reproducer that we can just run and get the error??
same to me
TypeError Traceback (most recent call last)
Cell In[15], [line 32](vscode-notebook-cell:?execution_count=15&line=32)
[29](vscode-notebook-cell:?execution_count=15&line=29) # silence the warnings. re-enable for inference!
[30](vscode-notebook-cell:?execution_count=15&line=30) model.config.use_cache = False
---> [32](vscode-notebook-cell:?execution_count=15&line=32) trainer.train()
[33](vscode-notebook-cell:?execution_count=15&line=33) model.save_pretrained("llama-7b-int4-dolly")
File [/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1539](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1539), in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
[1537](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1537) hf_hub_utils.enable_progress_bars()
[1538](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1538) else:
-> [1539](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1539) return inner_training_loop(
[1540](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1540) args=args,
[1541](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1541) resume_from_checkpoint=resume_from_checkpoint,
[1542](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1542) trial=trial,
[1543](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1543) ignore_keys_for_eval=ignore_keys_for_eval,
[1544](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1544) )
File [/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1808](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1808), in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
[1805](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1805) _ = list(sampler)
[1807](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1807) total_batched_samples = 0
-> [1808](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1808) for epoch in range(epochs_trained, num_train_epochs):
[1809](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1809) epoch_iterator = train_dataloader
[1810](https://vscode-remote+ssh-002dremote-002bml.vscode-resource.vscode-cdn.net/usr/local/anaconda3/envs/peft-venv-py310-cu117/lib/python3.10/site-packages/transformers/trainer.py:1810) if hasattr(epoch_iterator, "set_epoch"):
TypeError: 'float' object cannot be interpreted as an integer
@imneov a repr would be great for us to work with :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Bumping up this issue because I encountered it with the latest transformers version (4.43.3) when trying to resume training. The error comes from the line for epoch in range(epochs_trained, num_train_epochs): in file trainer.py. For whatever reason, epochs_trained is float instead of integer.
I manually edited the code in /usr/local/lib/python3.10/dist-packages/transformers/trainer.py to overcome the problem:
print(f'epochs_trained: {epochs_trained}') # in my case, this gave 1.0
print(f'num_train_epochs: {num_train_epochs}')
if type(epochs_trained) != int:
epochs_trained = int(epochs_trained) # now it should be integer, i.e., 1
if type(num_train_epochs) != int:
num_train_epochs = int(num_train_epochs)
for epoch in range(epochs_trained, num_train_epochs):
Now I can resume training.
Would you like to open a PR to fix this @teddy-f-47 ?