notebooks
notebooks copied to clipboard
getting error on Distributed Data Parallel training on multiple ml.p4.24xlarge instances
trafficstars
Getting error while trying to do Distributed Data Parallel training on 2 ml.p4.24xlarge instances, it works fine with single instance multiple GPUs and not with multiple instance, multi-gpu for ml.p4.24xlarge.
Warning: Permanently added 'algo-2,10.0.201.125' (ECDSA) to the list of known hosts.#015
[1,7]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,0]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,1]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,2]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,3]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,4]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,5]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,6]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,15]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set