accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Multi-GPU Training - sagemaker + accelerate

Open xenia-kra opened this issue 2 years ago • 4 comments

I'm trying use the following code

from sagemaker.estimator import Estimator

estimator = Estimator(image_uri='aaa.dkr.ecr.us-east-1.amazonaws.com/training',
                      role='arn:aws:iam::aaa:role/sagemaker-role',
                      instance_count=1,
                      entry_point='./train.py',
                      instance_type='ml.p4d.24xlarge')

estimator.fit()

Inside the train.py script there is a usage of Accelerator with initialization in the following way:

    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        logging_dir=logging_dir,
    )

However, when monitoring, we can see that the training code utilizes one gpu only.

What is missing? Is there a proper way to do accelerate config inside the custom container for Sagemaker?

xenia-kra avatar May 10 '23 09:05 xenia-kra

cc @philschmid

sgugger avatar May 10 '23 11:05 sgugger

@xenia-kra the Estimator/ SageMaker is using python train.py .... to execute your script. That should be visible in the logs of your jobs. Thats why you are only leveraging one GPU.

You can improve this by using the distribution configuration inside your estimator but this doesn't work for custom containers.

philschmid avatar May 10 '23 13:05 philschmid

@philschmid so it's not possible to use multi-gpu for custom container at al? Which side is the blocker in this integration - sagemaker or accelerate?

xenia-kra avatar May 10 '23 13:05 xenia-kra

This is not related to Accelerate. I am not sure the distributed integration is completely not available for custom DLCs but at least not for the vanilla Estimator. But if you are interested in distributed training on SageMaker, we created an example for FSDP: https://www.philschmid.de/sagemaker-fsdp-gpt

philschmid avatar May 10 '23 15:05 philschmid

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 09 '23 15:06 github-actions[bot]