notebooks icon indicating copy to clipboard operation
notebooks copied to clipboard

FSDP training not loading saving the best checkpoint

Open BSharmi opened this issue 1 year ago • 0 comments
trafficstars

Hi there!

I followed training a T5 model with FSDP on Sagemaker from the example https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py

I noticed that checkpointing is not done with save_strategy="no". Is it intentional(line https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py#L93)? In my training I changed it to save_strategy="steps" and noticed two issues

  1. Best checkpoints based on min validation loss is not saved. If I set the limit to 2 for e.g., the last 2 checkpoints are saved
  2. I was not able to load the trained model from checkpoint and got the error which is mentioned elsewhere in issues RuntimeError: Trying to resize storage that is not resizable. This does not happen if I want to load the final model. But it makes training hard since I need to know when to stop training so that I have the final model withe the minimum loss saved. I tried with different versions
PyTorch 1.13
Transformers 4.26

and

PyTorch 2.0.0
Transformers 4.28.1

and see the same issue with loading a model from checkpoint.

Would appreciate any pointers

Thank you!

BSharmi avatar Jan 22 '24 21:01 BSharmi