sagemaker-tensorflow-training-toolkit
sagemaker-tensorflow-training-toolkit copied to clipboard
Model deployment is failing with the error "The primary container for production variant AllTraffic did not pass the ping health check.
Trying to deploy a custom Word2Vec model that I've trained offline as a SageMaker endpoint. Followed the documentation - https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/scikit_bring_your_own to create docker file and everything.
I've added the following in docker file - # ENTRYPOINT ["python3", "/usr/local/bin/predictor.py"]
Looking at the logs, I am able to see that this code is running and I am able to load the model, but the model isn't getting deployed and fails with the error - Model deployment is failing with the error "The primary container for production variant AllTraffic did not pass the ping health check.
Any help?
Hi @vishwath96, are you able to share your logs and the full stack trace?
Hi I am having the same error. I am deploying my own dlib model. The cloud watch logs is this What does it means?
2022/06/15 21:08:37 [error] 19#19: *1 js: failed ping{ "error": "Servable not found for request: Latest(persona-id)" }
Kindly help. Thank you.
@ajaykarpur I followed your notebook which was helpful but fails at deployment too. Here's my stacktrace - all help will be appreciated, been blocked on it for a while now. And for this error - don't understand how the model is read-only when it dumps the .pkl file in s3 perfectly fine. but when I try to deploy it
from sagemaker.predictor import csv_serializer
predictor = tree.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)
I run into this error.
Starting the training.
Traceback (most recent call last):
File "/opt/ml/train", line 55, in train
with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:
Traceback (most recent call last): File "/opt/ml/train", line 55, in train with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out:
OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/ml/train", line 72, in <module>
train()
File "/opt/ml/train", line 64, in train
with open(os.path.join(output_path, 'failure'), 'w') as s:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/failure'
@ajaykarpur I followed your notebook which was helpful but fails at deployment too. Here's my stacktrace - all help will be appreciated, been blocked on it for a while now. And for this error - don't understand how the model is read-only when it dumps the .pkl file in s3 perfectly fine. but when I try to deploy it
from sagemaker.predictor import csv_serializer predictor = tree.deploy(1, "ml.m4.xlarge", serializer=csv_serializer)
I run into this error.
Starting the training. Traceback (most recent call last): File "/opt/ml/train", line 55, in train with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out: Traceback (most recent call last): File "/opt/ml/train", line 55, in train with open(os.path.join(model_path, 'decision-trees.pkl'), 'wb') as out: OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/ml/train", line 72, in <module> train() File "/opt/ml/train", line 64, in train with open(os.path.join(output_path, 'failure'), 'w') as s: FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/failure'
Hi, For resolving it. in docker container inside the "/opt/ml/output/" directory there should be a file with the name of failure. And this is occurring because the training is going to be failed for some reason.
@birla8319 the error is this statement: OSError: [Errno 30] Read-only file system: '/opt/ml/model/decision-trees.pkl'
and I see this puzzle under my cloudwatch logs. The model pickle files are dumped in s3 and I don't see /opt/ml/output/failure
results dumped in S3 either.