amazon-sagemaker-examples
amazon-sagemaker-examples copied to clipboard
[Please Help!!] Error when hosting tensorflow endpoint in script mode
I tried to build my own tensorflow algorithm in train.py by adopting from mnist-2.py (available in amazon-sagemaker-examples/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/) and pass it as entry_point using the pre-built deep learning image. The training job completed with warning and there is an error in deploying the model.
Below is the main function in train.py:
if __name__ == "__main__":
args, unknown = _parse_args()
train_data, train_labels = _load_training_data(args.train)
eval_data, eval_labels = _load_testing_data(args.train)
print('Training model for {} epochs and {} batch size..\n\n'.format(args.epochs, args.batch_size))
classifier = model(train_data, train_labels, eval_data, eval_labels, epochs=args.epochs, batch_size=args.batch_size)
if args.current_host == args.hosts[0]:
# save model in SaveModel format
classifier.save(os.path.join(args.sm_model_dir, "nn_model.h5"))
# save model in Keras h5 format
classifier.save(os.path.join(args.sm_model_dir, "nn_classifier"))
The estimator created in the notebook instance is as follows:
estimator = TensorFlow(entry_point='train.py',
role=sagemaker.get_execution_role(),
distribution={"parameter_server": {"enabled": True}},
image_uri='763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.6.0-cpu-py38-ubuntu20.04',
training_steps= 100,
evaluation_steps= 100,
instance_count=2,
instance_type='ml.m5.4xlarge',
hyperparameters={
'epochs': EPOCHS,
'batch_size': BATCH_SIZE
})
Here are the problems I encountered:
- Warnining: no model artifact is saved under path /opt/ml/model. However, I did find a
model.tar.gzin the output folder of this training job in S3.
2021-10-29 23:48:58 Uploading - Uploading generated training model 2021-10-29 23:48:58 Completed - Training job completed 2021-10-29 23:48:48,662 sagemaker_tensorflow_container.training INFO master algo-1 is down, stopping parameter server 2021-10-29 23:48:48,663 sagemaker_tensorflow_container.training WARNING No model artifact is saved under path /opt/ml/model. Your training job will not save any model files to S3. For details of how to construct your training script see: https://sagemaker.readthedocs.io/en/stable/using_tf.html#adapting-your-local-tensorflow-script 2021-10-29 23:48:48,663 sagemaker-training-toolkit INFO Reporting training SUCCESS
- When deploying the estimator after
.fit, by running
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')
- It first pops out below message
update_endpoint is a no-op in sagemaker>=2. See: https://sagemaker.readthedocs.io/en/stable/v2.html for details. I don't know how the
update_endpoint()would cause an issue given it's still at creating stage and how to implement thesagemaker.predictor.Predictor.update_endpoint()mentioned in the refernce link in mytrain.pyscript or the notebook.
- After half an hour of creating the endpoint it fails by showing
UnexpectedStatusException: Error hosting endpoint tensorflow-training-2021-10-29-23-53-53-781: Failed. > Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..
So I also checked the CloudWatch logs for this endpoint and they all show the same event of
exec: "serve": executable file not found in $PATH
Can anyone help me with these problems? Many thanks in advance!
Having the same issue here, did you manage to solve it?
Having the same issue here, did you manage to solve it?
Nope, I haven't solve it....
Hi guys! I´m having the same problem.
Hi! I come back with the solution, after playing and reading a lot. I decided to try another image_uri and it works. So In training I used the following aws sagemaker image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.2.0-gpu-py37-cu102-ubuntu18.04
It works and I saved the model artifacts, then I loaded it:
from sagemaker.tensorflow import TensorFlowModel
model = TensorFlowModel(model_data='s3://...../output/model.tar.gz',
role=role,
image_uri = '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.2.0-gpu-py37-cu102-ubuntu18.04' )
As you can see I changed the image_uri from training to inference. After that I was able to create the endpoing. I hopo this help you in your projects!
@Janelle-He Hi, I met same error (TODAY!!!) and fix like that. At my case, I make wrong way to .tar.gz file.
import tarfile
# before(error!)
with tarfile.open(f"api_version/{api_version}_{postfix}.tar.gz", "w") as f:
# after(fix)
with tarfile.open(f"api_version/{api_version}_{postfix}.tar.gz", "w:gz", format=tarfile.GNU_FORMAT) as f: