sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

Sagemaker Endpoint vanishing without traces

Open danielcavalli opened this issue 1 year ago • 2 comments

Describe the bug I'm currently using Sagemaker to host a custom ML model deployed to two accounts, homolog, and production. Both endpoints have the same entry point code and were deployed the same day. The homologation version suddenly disappeared on June 28th, leaving no traces besides the last HealthCheck ping on CloudWatch. After searching CloudTrail logs to see what could have happened, there was nothing out of the ordinary: deployed the endpoint and that was it. No delete command coming from anywhere. I thought of it as a bug and promptly redeployed the model, on June 29th, assuming it wouldn't happen again. The issue is that on July 3rd the endpoint vanished without traces again. Same thing, no delete, no update, no renaming of anything on CloudTrail, and the only proof that it was ever on running were the CloudWatch logs and the CreateEndpoint entry on CloudTrail.

To reproduce Couldn't reproduce the bug willingly. I couldn't gather any evidence that could lead me to the cause of the problem.

Expected behavior For it not to vanish

Screenshots or logs

System information A description of your system. Please provide:

  • SageMaker Python SDK version: 2.72.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): SKLearn
  • Framework version: 0.23
  • Python version: 3.6
  • CPU or GPU: ml.t2.medium
  • Custom Docker image (Y/N): N

Additional context Some more information:

  • The homologation(testing) endpoint wasn't called all that often and had big gaps between calls, they would only happen when we were testing something.
  • It was deployed through an AWS Sagemaker Notebook using the Sagemaker SDK for Python
  • First time the delta between a request and going offline was 8 hours, and the second time the delta was 48h.

danielcavalli avatar Jul 19 '22 15:07 danielcavalli

Would love the help(and to help fix it if it may be the case) on this!

danielcavalli avatar Jul 19 '22 15:07 danielcavalli

up

danielcavalli avatar Jul 29 '22 13:07 danielcavalli