sagemaker-training-toolkit icon indicating copy to clipboard operation
sagemaker-training-toolkit copied to clipboard

Pass args to training script entrypoint for MPI-based Distributed training

Open ChaiBapchya opened this issue 5 years ago • 3 comments

Describe the feature you'd like Pass arguments to the training script while using Horovod via MPI for Distributed training.

Current Situation ~~Only~~ ProcessRunner supports passing hyperparameters https://github.com/aws/sagemaker-training-toolkit/blob/c357433d6fdbc43a896b25bd126c46f689ddb73c/src/sagemaker_training/process.py#L105-L109

~~MPIRunner doesn't support it.~~ MPIRunner supports it: https://github.com/aws/sagemaker-training-toolkit/blob/c357433d6fdbc43a896b25bd126c46f689ddb73c/src/sagemaker_training/mpi.py#L41-L45

How would this feature be used? Please describe. Example API would be

mpi_options = '-verbose -x orte_base_help_aggregate=0'
estimator = MXNet(
    entry_point='hvd_resnet_mx.sh',
    role=role,
    train_instance_type='ml.p3.8xlarge',
    train_instance_count=2,
    image_name=image,
    framework_version='1.6.0',
    py_version='py3',
    hyperparameters={'sagemaker_mpi_enabled': True,
                     'sagemaker_mpi_custom_mpi_options': mpi_options,
                     'sagemaker_mpi_num_of_processes_per_host': 4},
    sagemaker_session=sagemaker_session)

Where entry-point script is hvd_resnet_mx.sh

! pygmentize hvd_resnet_launcher.sh
./hvd_resnet_mx.py --num-epochs 5

Describe alternatives you've considered ~~Right now, one has to use ProcessRunner instead of MPIRunner to pass bash script for training~~

estimator = MXNet(
    entry_point='hvd_resnet_launcher.sh',
    role=role,
    train_instance_type='ml.p3.8xlarge',
    train_instance_count=2,
    image_name=image,
    framework_version='1.6.0',
    py_version='py3',
    hyperparameters={'sagemaker_parameter_server_enabled': True
                    },
    sagemaker_session=sagemaker_session)

ChaiBapchya avatar Jul 01 '20 20:07 ChaiBapchya

@ChuyangDeng @laurenyu

ChaiBapchya avatar Jul 01 '20 20:07 ChaiBapchya

Discussed with @ChaiBapchya offline, https://github.com/aws/sagemaker-training-toolkit/blob/master/src/sagemaker_training/mpi.py#L43 might be what he needs.

However, it's difficult to find an example of the usage of args

chuyang-deng avatar Jul 01 '20 20:07 chuyang-deng

Documentation part is missing. Can someone help with adding that? Maybe on-call?

ChaiBapchya avatar Jul 02 '20 01:07 ChaiBapchya