amazon-sagemaker-examples
amazon-sagemaker-examples copied to clipboard
[Bug Report] ModelError when calling the InvokeEndpoint operation
Link to the notebook https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining-elastic-inference.ipynb
Describe the bug The endpoint is working without any problems, but sometimes, the endpoint suddenly stops working and any request to the endpoint resolves to the following error:
An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Amazon SageMaker could not get a response from the [ENDPOINT NAME] endpoint. This can occur when CPU or memory utilization is high. To check your utilization, see Amazon CloudWatch. To fix this problem, use an instance type with more CPU capacity or memory.". See https://eu-central-1.console.aws.amazon.com/cloudwatch/home?region=eu-central-1#logEventViewer:group=/aws/sagemaker/Endpoints/[ENDPOINT NAME] in account XXXXXXXXXXXX for more information.
It seems as if the instance type runs out of memory or CPU, but I have performed stress tests on the endpoint it handles many requests without any problem. I have monitored the memory and CPU during the stress tests and neither of them go over 10%. The endpoint logs show that the workers are "abnormally terminated" and new workers are instantiated, but these are terminated as well. (full logs attached)
I want to be able to control this error and understand why is it happening and how to prevent it.
To reproduce I can't find a way to reproduce this error as this happens all of a sudden in the endpoint instance.
Logs Enpoint logs
Also experiencing this error with <10% mem./vCPU usage on the ML instances.
Same here. Could anyone look into it?