sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

OOM when resuming training from checkpoint

Open renziver opened this issue 1 year ago • 0 comments

Describe the bug I'm encountering an OOM error while using the Hugging Face Estimator API while following the Spot Instance training sample which basically needs a way to resume training from checkpoints.

I have a script that uses the Trainer API that then gets to be passed on to the Hugging Face estimator for training on the SageMaker environment.

trainer = Trainer(
        model=model,
        args=train_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
    )
    if train_from_checkpoint:
        last_checkpoint = get_last_checkpoint(checkpoint_dir)
        trainer.train(resume_from_checkpoint=last_checkpoint)
    else:
        trainer.train()

I have confirmed that without using the Hugging Face estimator API, the training script that I use could be successfully completed. This led me to the following issue on Hugging Face where it is reported that there has been a memory leak issue in Hugging Face for training that is resumed from checkpoints which spot instance training heavily relies on.

Screenshots or logs Screen Shot 2022-09-13 at 4 33 35 PM

System information A description of your system. Please provide:

  • SageMaker Python SDK version: 2.108.0
  • Pytorch version: 1.10.2
  • Hugging Face version: 4.17.0 (maximum supported of SageMaker Hugging Face estimator
  • Python version: 3.8
  • GPU: ml.p3.2xlarge (1 V100 GPU)

renziver avatar Sep 13 '22 08:09 renziver