notebooks icon indicating copy to clipboard operation
notebooks copied to clipboard

getting error on Distributed Data Parallel training on multiple ml.p4.24xlarge instances

Open mani-aiml opened this issue 3 years ago • 1 comments
trafficstars

Getting error while trying to do Distributed Data Parallel training on 2 ml.p4.24xlarge instances, it works fine with single instance multiple GPUs and not with multiple instance, multi-gpu for ml.p4.24xlarge.

Warning: Permanently added 'algo-2,10.0.201.125' (ECDSA) to the list of known hosts.#015
[1,7]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,0]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,1]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,2]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,3]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,4]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,5]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,6]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set
[1,15]<stderr>:Environment variable SAGEMAKER_INSTANCE_TYPE is not set

mani-aiml avatar Oct 27 '22 19:10 mani-aiml