transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Sharded safetensors are not loaded nor pushed to hub in Trainer

Open AlexTMallen opened this issue 11 months ago • 0 comments

System Info

  • transformers version: 4.37.2
  • Platform: Linux-5.19.17-coreweave-x86_64-with-glibc2.31
  • Python version: 3.11.7
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@muellerzr and @pacman100

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

This issue occurs when using load_best_model_at_end in Trainer with models large enough to need sharded checkpoints (e.g. Pythia 2.8b but not 1.4b).

https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/trainer.py#L2361 Should be updated to account not only for pickle indices but also safetensors indices.

Expected behavior

I expected the script to push model checkpoints to the hub, but it does not. Nor can it load the best model at the end (which should be easily fixed at the line linked above).

AlexTMallen avatar Mar 22 '24 05:03 AlexTMallen