Chaitanya Prakash Bapat
Chaitanya Prakash Bapat
Upon running on latest master ``` 61f01e3c (HEAD -> master, upstream/master, origin/master, origin/HEAD) add chapter author photos ``` Command ``` d2l-en bapac$ d2lbook build html ``` It fails with `IndexError:...
I haven't committed single line of code to tensorflow repository. But based on the sourcerer.io it seems I have committed 90 lines of code [which is incorrect] https://sourcerer.io/chaibapchya 0 lines...
Hey, I found this toolkit to be really interesting. I thought it would be great to add Apache MXNet as a potential backend alternative too. Resources: - [MXNet Website](https://mxnet.incubator.apache.org/) -...
**Describe the bug** custom_mpi_options flag in the sagemaker training toolkit isn't over-riding the MPI command instead it just appends the flags Logic https://github.com/aws/sagemaker-training-toolkit/blob/c357433d6fdbc43a896b25bd126c46f689ddb73c/src/sagemaker_training/mpi.py#L185-L188 **To reproduce** ``` mpi_options = '-verbose -x...
**Describe the bug** Currently functional test for mpi is skipped https://github.com/aws/sagemaker-training-toolkit/blob/96a941c938eba2e5350c662e4f9575f32cd0caf2/test/functional/test_mpi.py#L33-L36 However, the mentioned PR has long been merged https://github.com/aws/sagemaker-python-sdk/pull/559
**Describe the feature you'd like** Pass arguments to the training script while using Horovod via MPI for Distributed training. **Current Situation** ~~Only~~ ProcessRunner supports passing hyperparameters https://github.com/aws/sagemaker-training-toolkit/blob/c357433d6fdbc43a896b25bd126c46f689ddb73c/src/sagemaker_training/process.py#L105-L109 ~~MPIRunner doesn't support...
Since the dockerfiles, tests have been migrated to https://github.com/aws/deep-learning-containers We should clean up the redundant [and hence confusing] files such as - Dockerfiles - test/
The current Readme gives a wrong notion that to build image one has to use Dockerfiles present in this repo However, it looks like the Dockerfiles & tests folder have...
Test integration ``` pytest test/integration/sagemaker/test_horovod.py --docker-base-name sm-tf-horovod-integration --tag latest --framework-version 1.15.0 --processor gpu ``` Error stacktrace: ``` sagemaker.exceptions.UnexpectedStatusException: Error for Training job test-tf-horovod-1591768266-74da: Failed. Reason: Alg orithmError: ExecuteUserScriptError: E Command...
Checklist - [x] I've prepended issue tag with type of change: [bug] - [ ] (If applicable) I've attached the script to reproduce the bug - [ ] (If applicable)...