transformers
transformers copied to clipboard
Sharded safetensors are not loaded nor pushed to hub in Trainer
System Info
-
transformers
version: 4.37.2 - Platform: Linux-5.19.17-coreweave-x86_64-with-glibc2.31
- Python version: 3.11.7
- Huggingface_hub version: 0.20.3
- Safetensors version: 0.4.2
- Accelerate version: 0.27.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
@muellerzr and @pacman100
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
This issue occurs when using load_best_model_at_end
in Trainer with models large enough to need sharded checkpoints (e.g. Pythia 2.8b but not 1.4b).
https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/trainer.py#L2361 Should be updated to account not only for pickle indices but also safetensors indices.
Expected behavior
I expected the script to push model checkpoints to the hub, but it does not. Nor can it load the best model at the end (which should be easily fixed at the line linked above).