Pass args to training script entrypoint for MPI-based Distributed training
Describe the feature you'd like Pass arguments to the training script while using Horovod via MPI for Distributed training.
Current Situation ~~Only~~ ProcessRunner supports passing hyperparameters https://github.com/aws/sagemaker-training-toolkit/blob/c357433d6fdbc43a896b25bd126c46f689ddb73c/src/sagemaker_training/process.py#L105-L109
~~MPIRunner doesn't support it.~~ MPIRunner supports it: https://github.com/aws/sagemaker-training-toolkit/blob/c357433d6fdbc43a896b25bd126c46f689ddb73c/src/sagemaker_training/mpi.py#L41-L45
How would this feature be used? Please describe. Example API would be
mpi_options = '-verbose -x orte_base_help_aggregate=0'
estimator = MXNet(
entry_point='hvd_resnet_mx.sh',
role=role,
train_instance_type='ml.p3.8xlarge',
train_instance_count=2,
image_name=image,
framework_version='1.6.0',
py_version='py3',
hyperparameters={'sagemaker_mpi_enabled': True,
'sagemaker_mpi_custom_mpi_options': mpi_options,
'sagemaker_mpi_num_of_processes_per_host': 4},
sagemaker_session=sagemaker_session)
Where entry-point script is
hvd_resnet_mx.sh
! pygmentize hvd_resnet_launcher.sh
./hvd_resnet_mx.py --num-epochs 5
Describe alternatives you've considered ~~Right now, one has to use ProcessRunner instead of MPIRunner to pass bash script for training~~
estimator = MXNet(
entry_point='hvd_resnet_launcher.sh',
role=role,
train_instance_type='ml.p3.8xlarge',
train_instance_count=2,
image_name=image,
framework_version='1.6.0',
py_version='py3',
hyperparameters={'sagemaker_parameter_server_enabled': True
},
sagemaker_session=sagemaker_session)
@ChuyangDeng @laurenyu
Discussed with @ChaiBapchya offline, https://github.com/aws/sagemaker-training-toolkit/blob/master/src/sagemaker_training/mpi.py#L43 might be what he needs.
However, it's difficult to find an example of the usage of args
Documentation part is missing. Can someone help with adding that? Maybe on-call?