accelerate
accelerate copied to clipboard
Multi-GPU Training - sagemaker + accelerate
I'm trying use the following code
from sagemaker.estimator import Estimator
estimator = Estimator(image_uri='aaa.dkr.ecr.us-east-1.amazonaws.com/training',
role='arn:aws:iam::aaa:role/sagemaker-role',
instance_count=1,
entry_point='./train.py',
instance_type='ml.p4d.24xlarge')
estimator.fit()
Inside the train.py script there is a usage of Accelerator with initialization in the following way:
accelerator = Accelerator(
gradient_accumulation_steps=args.gradient_accumulation_steps,
mixed_precision=args.mixed_precision,
log_with=args.report_to,
logging_dir=logging_dir,
)
However, when monitoring, we can see that the training code utilizes one gpu only.
What is missing? Is there a proper way to do accelerate config inside the custom container for Sagemaker?
cc @philschmid
@xenia-kra the Estimator/ SageMaker is using python train.py .... to execute your script. That should be visible in the logs of your jobs. Thats why you are only leveraging one GPU.
You can improve this by using the distribution configuration inside your estimator but this doesn't work for custom containers.
@philschmid so it's not possible to use multi-gpu for custom container at al? Which side is the blocker in this integration - sagemaker or accelerate?
This is not related to Accelerate. I am not sure the distributed integration is completely not available for custom DLCs but at least not for the vanilla Estimator. But if you are interested in distributed training on SageMaker, we created an example for FSDP: https://www.philschmid.de/sagemaker-fsdp-gpt
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.