transformers Sharded safetensors are not loaded nor pushed to hub in Trainer

Sharded safetensors are not loaded nor pushed to hub in Trainer

Open AlexTMallen opened this issue 11 months ago • 0 comments

System Info

transformers version: 4.37.2
Platform: Linux-5.19.17-coreweave-x86_64-with-glibc2.31
Python version: 3.11.7
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@muellerzr and @pacman100

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

This issue occurs when using load_best_model_at_end in Trainer with models large enough to need sharded checkpoints (e.g. Pythia 2.8b but not 1.4b).

https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/trainer.py#L2361 Should be updated to account not only for pickle indices but also safetensors indices.

Expected behavior

I expected the script to push model checkpoints to the hub, but it does not. Nor can it load the best model at the end (which should be easily fixed at the line linked above).

Mar 22 '24 05:03 AlexTMallen

transformers transformers copied to clipboard

Sharded safetensors are not loaded nor pushed to hub in Trainer

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard