amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

Error for training job failed. reason: algorithmerror: exit code: 127

Open katreparitosh opened this issue 3 years ago • 2 comments

Hello,

Same to #969

I was training a DistilBERT model on SageMaker instance using fast-bert. I am using the ml.p2.xlarge instance for GPU processing.

When the function downloads the training image from ECR during fit(), I happen to receive "/usr/bin/env: ‘python\r’: No such file or directory". See below -

image

And, at the end of stack-trace received the following - error for training job failed. reason: algorithmerror: exit code: 127

image

Tech Stack-

fast-bert docker image SageMaker NB Instance - ml.t2.medium GPU Compute - ml.p2.xlarge

What could be the reason for this error? My IAM role has all the required permissions.

Kindly help.

katreparitosh avatar Oct 03 '20 12:10 katreparitosh