sagemaker-python-sdk
sagemaker-python-sdk copied to clipboard
OOM when resuming training from checkpoint
Describe the bug I'm encountering an OOM error while using the Hugging Face Estimator API while following the Spot Instance training sample which basically needs a way to resume training from checkpoints.
I have a script that uses the Trainer API that then gets to be passed on to the Hugging Face estimator for training on the SageMaker environment.
trainer = Trainer(
model=model,
args=train_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
if train_from_checkpoint:
last_checkpoint = get_last_checkpoint(checkpoint_dir)
trainer.train(resume_from_checkpoint=last_checkpoint)
else:
trainer.train()
I have confirmed that without using the Hugging Face estimator API, the training script that I use could be successfully completed. This led me to the following issue on Hugging Face where it is reported that there has been a memory leak issue in Hugging Face for training that is resumed from checkpoints which spot instance training heavily relies on.
Screenshots or logs
System information A description of your system. Please provide:
- SageMaker Python SDK version: 2.108.0
- Pytorch version: 1.10.2
- Hugging Face version: 4.17.0 (maximum supported of SageMaker Hugging Face estimator
- Python version: 3.8
- GPU: ml.p3.2xlarge (1 V100 GPU)