sagemaker-tensorflow-training-toolkit icon indicating copy to clipboard operation
sagemaker-tensorflow-training-toolkit copied to clipboard

Model deployment is failing with the error "The primary container for production variant AllTraffic did not pass the ping health check.

Open vishwath96 opened this issue 4 years ago • 5 comments

Trying to deploy a custom Word2Vec model that I've trained offline as a SageMaker endpoint. Followed the documentation - https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/scikit_bring_your_own to create docker file and everything.

I've added the following in docker file - # ENTRYPOINT ["python3", "/usr/local/bin/predictor.py"]

Looking at the logs, I am able to see that this code is running and I am able to load the model, but the model isn't getting deployed and fails with the error - Model deployment is failing with the error "The primary container for production variant AllTraffic did not pass the ping health check.

Any help?

vishwath96 avatar Sep 30 '20 10:09 vishwath96

Hi @vishwath96, are you able to share your logs and the full stack trace?

ajaykarpur avatar Oct 07 '20 21:10 ajaykarpur

Hi I am having the same error. I am deploying my own dlib model. The cloud watch logs is this What does it means?

2022/06/15 21:08:37 [error] 19#19: *1 js: failed ping{ "error": "Servable not found for request: Latest(persona-id)" }

Kindly help. Thank you.

jocelynbaduria avatar Jun 15 '22 21:06 jocelynbaduria

@ajaykarpur I followed your notebook which was helpful but fails at deployment too. Here's my stacktrace - all help will be appreciated, been blocked on it for a while now. And for this error - don't understand how the model is read-only when it dumps the .pkl file in s3 perfectly fine. but when I try to deploy it

from sagemaker.predictor import csv_serializer
predictor = tree.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)

I run into this error.

Starting the training.
Traceback (most recent call last):
  File "/opt/ml/train", line 55, in train
    with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:
Traceback (most recent call last): File "/opt/ml/train", line 55, in train with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:

OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/ml/train", line 72, in <module>
    train()
  File "/opt/ml/train", line 64, in train
    with open(os.path.join(output_path, 'failure'), 'w') as s:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/failure'

priyakhokher avatar Jul 28 '22 23:07 priyakhokher

@ajaykarpur I followed your notebook which was helpful but fails at deployment too. Here's my stacktrace - all help will be appreciated, been blocked on it for a while now. And for this error - don't understand how the model is read-only when it dumps the .pkl file in s3 perfectly fine. but when I try to deploy it

from sagemaker.predictor import csv_serializer
predictor = tree.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)

I run into this error.

Starting the training.
Traceback (most recent call last):
  File "/opt/ml/train", line 55, in train
    with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:
Traceback (most recent call last): File "/opt/ml/train", line 55, in train with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:

OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/ml/train", line 72, in <module>
    train()
  File "/opt/ml/train", line 64, in train
    with open(os.path.join(output_path, 'failure'), 'w') as s:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/failure'

Hi, For resolving it. in docker container inside the "/opt/ml/output/" directory there should be a file with the name of failure. And this is occurring because the training is going to be failed for some reason.

ankitvirla avatar Aug 02 '22 10:08 ankitvirla

@birla8319 the error is this statement: OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl' and I see this puzzle under my cloudwatch logs. The model pickle files are dumped in s3 and I don't see /opt/ml/output/failure results dumped in S3 either.

priyakhokher avatar Aug 03 '22 16:08 priyakhokher