amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

sagemaker endpoint can't be deleted if it is stuck in creating state due to resource limit

Open rbavery opened this issue 2 years ago • 39 comments

please address this issue. if a sagemaker endpoint deployment hits a resource limit, it gets stuck forever and there is no option to delete it: https://stackoverflow.com/questions/65678237/sagemaker-endpoint-stuck-at-creating

rbavery avatar May 27 '22 19:05 rbavery

Hi! This is happening to me right now

Rebecasarai avatar Jul 13 '22 11:07 Rebecasarai

Currently experiencing a version of this too.

In this case due to docker host platform being wrong and the python server not being able to start properly.

jonasdebeukelaer avatar Jul 15 '22 14:07 jonasdebeukelaer

I was trying to create a sagemaker endpoint using terraform CLI and I ctrl + c my command line halfway when the endpoint is being created because I found some errors in my terraform script, right now my endpoint is stucked at Creating status and I have no way of deleting it.

Anyone found a solution to this?

chiayiffg avatar Sep 19 '22 08:09 chiayiffg

Hi, I am facing the same issue, has anyone found a solution to this. Thanks

jainalphin99 avatar Oct 04 '22 12:10 jainalphin99

hi @jainalphin99 , I went back to my AWS UI after 1 day and the status of the endpoint automatically changed from Creating to Failed , then I can delete the endpoint again, maybe try deleting it tomorrow and work with a new endpoint for now

chiayiffg avatar Oct 05 '22 04:10 chiayiffg

this is so annoying when it happens, it kills 20min or more just for cases where you miscofigured the docker host

asosnovsky-sumologic avatar Dec 08 '22 22:12 asosnovsky-sumologic

For us it happened because of missing model or endpoint configuration, which caused it to get stuck on Creating, failing only after 1 hour! There needs to be some way to force delete endpoints, because it can't be that misconfigurations cause a blocker for such a long time.

Also saw this happened during usage of serverless endpoints with correct configurations and everything. Not sure what's going on but it stops us from using SageMaker in our CD to start endpoints on demand

zionsofer avatar Feb 15 '23 10:02 zionsofer

I'm seeing the same thing, in the logs it keeps trying and failing to install the same package over and over

david-waterworth avatar Feb 21 '23 00:02 david-waterworth

I experienced the same with @asosnovsky-sumologic and @zionsofer, whenever I accidentally forget to install a certain package or when I misconfigured any aws role my code needs to retrieve, I will need to wait close to 20 mins for the status to become fail before I can redeploy again

@david-waterworth From what I observe, when you first deploy the endpoint, sagemaker will try to call your /ping endpoint, even if something is failing, the /ping health check will continue for 20mins (from my experience) before the endpoint gives up and return Failed status

To unblock yourself, I suggest working with a new endpoint with different name first because waiting for 20mins to be able to redeploy a change again is quite time consuming and demotivating

chiayiffg avatar Feb 21 '23 08:02 chiayiffg

Agree why can't a sig be sent to the server to terminate the running processes and send it into a failed state? It's currently trying the same thing over and over again.

rbp15 avatar Feb 23 '23 19:02 rbp15

@chiayiffg that is what I do, but when you try to iterate like this, you end up creating like 10 endpoints that you have to then remeber to delete after some time.

asosnovsky-sumologic avatar Feb 23 '23 19:02 asosnovsky-sumologic

@chiayiffg I think that happens for certain kinds of failures, for me because it gets stuck in an infinite loop building the container I don't think it ever does the health ping. I think there's another timeout that's longer than 20 mins (maybe an hour). I've logged an issue for my specific problem so hopefully, it'll be addressed one day.

david-waterworth avatar Feb 23 '23 21:02 david-waterworth

+1 because it can be stuck when endpoint uses Graviton2 without using arm64 docker image.

grraffe avatar Mar 15 '23 11:03 grraffe

+1 no way to interrupt creating an endpoint is really a waste of time. Feels like it is a way to get more money from users waiting for timeout to finish.

ylhsieh avatar Apr 16 '23 07:04 ylhsieh

+1 currently stuck at creating for 20+ minutes because it can't find pip package(due to typo). Can't stop, can't delete... the more I use sagemaker the more I'm shocked how much obvious basic functionality is missing

cceyda avatar Apr 24 '23 19:04 cceyda

it is quite annoying and i have to delete endpoint config in order to make it fail before 20mins

em-eman avatar Apr 27 '23 14:04 em-eman

Just got in the same situation.. I think it is a critical feature to implement!!

Rhuax avatar May 09 '23 23:05 Rhuax

would really love for this to be added. Currently stuck right now in Creating status. :(

bjmrevilla avatar May 17 '23 06:05 bjmrevilla

+1

danb27 avatar Jun 16 '23 07:06 danb27

I tried creating sagemaker endpoint from the notebook instance, It stuck in that state for ~30 mins. So I gave a keyboard interrupt in the notebook cell, to stop that process. Now it is stuck at creating status.

RahulJana avatar Jun 22 '23 06:06 RahulJana

Still the same issue, couldn't agree more with @ylhsieh , probably a try at scraping that extra penny from the customer, otherwise don't see why there cannot be kill switch, this doesn't depend on anything except the model and the modelConfig which are both static resources.

nisalupendra avatar Jul 27 '23 14:07 nisalupendra

Yup same here! really sucks -- I assume the timeout on the creating => failed is in part based on the parameter "container_startup_health_check_timeout" (at least in my case since I'm trying to deploy a fine-tuned LLaMA as a HuggingFacePredictor) ... still they should do something about it! been waiting here for about 45mins + smh

bcarsley avatar Aug 17 '23 22:08 bcarsley

It does seem like deleting the model image that the deployment is based upon frees up the resources slightly faster than waiting for "Failed", as you throw an arn error by deleting the model a deployment in the "creating" status is working off of (which then enables you to delete the deployment, at least in my case)... a very risky and probably inadvisable workaround!

Update: the key is actually to remove both the CloudWatch logs and the model itself ... still a hacky workaround, but it does successfully cut short failing runs (AWS seems to monitor health via CloudWatch log streams so deleting that seems to help expedite the whole process) ... still they need to fix this!

bcarsley avatar Aug 17 '23 22:08 bcarsley

It doesn't seem like sagemaker is made for production systems honestly.

davidshhh avatar Oct 25 '23 06:10 davidshhh

Same issue. Can the SageMaker team fix this? It's really annoying

orcaman avatar Oct 25 '23 10:10 orcaman

+1

I created a model, endpoint config and endpoint via terraform apply and since the endpoint was taking forever to create I simply terraform destroy everything - however, it is impossible to delete the endpoint (removing the model and endpoint config went through).

This happened to me multiple times in the past.

jankrepl avatar Nov 15 '23 12:11 jankrepl

How did you solve this problem?

I have a SageMaker Endpoint that is "In Service" with InferenceComponent that is stuck in the "Creating" state. In order to delete the endpoint (to avoid charges for the underlying instance) I need to delete the InferenceComponent - but I can't while it is in the "Creating" state - it's been like that for almost 2 days now - clearly bugged. In the meantime I'm charged almost 2USD per hour.

I can't delete it via CLI, nor via AWS Console, nor via SageMaker Studio... -.-

See below

aws sagemaker list-endpoints

{
    "Endpoints": [
        {
            "EndpointName": "llama2-endpoint",
            "EndpointArn": "arn:aws:sagemaker:eu-central-1:592532275118:endpoint/llama2-endpoint",
            "CreationTime": "2023-12-19T16:03:34.976000+01:00",
            "LastModifiedTime": "2023-12-19T16:05:22.443000+01:00",
            "EndpointStatus": "InService"
        }
    ]
}

aws sagemaker list-inference-components

{
    "InferenceComponents": [
        {
            "CreationTime": "2023-12-19T16:22:21.345000+01:00",
            "InferenceComponentArn": "arn:aws:sagemaker:eu-central-1:592532275118:inference-component/llama2-7b-20231219-152221",
            "InferenceComponentName": "llama2-7b-20231219-152221",
            "EndpointArn": "arn:aws:sagemaker:eu-central-1:592532275118:endpoint/llama2-endpoint",
            "EndpointName": "llama2-endpoint",
            "VariantName": "variant-1",
            "InferenceComponentStatus": "Creating",
            "LastModifiedTime": "2023-12-19T16:22:22.333000+01:00"
        }
    ]
}

If I try to delete it aws sagemaker delete-inference-component --inference-component-name "llama2-7b-20231219-152221" The error returned is

An error occurred (ValidationException) when calling the DeleteInferenceComponent operation: Cannot delete inference component "arn:aws:sagemaker:eu-central-1:592532275118:inference-component/llama2-7b-20231219-152221" while it is in state "CREATE_IN_PROGRESS".```

MarkoMilos avatar Dec 20 '23 15:12 MarkoMilos

Same issue for me as well. Unable to delete it due to Status CREATE_IN_PROGRESS. Any option to set inference component status to Failed or deleted from cli?

Any solutions, please suggest.

sagar1001 avatar Dec 22 '23 06:12 sagar1001

+1 no way to interrupt creating an endpoint is really a waste of time. Feels like it is a way to get more money from users waiting for timeout to finish.

you are right, A year later aws still hasn't fixed this issue

Seven2Nine avatar Dec 26 '23 09:12 Seven2Nine

I had the similar issue:

  • The inference endpoint has the status of "InService".
  • There is no Model associated.
  • There is no endpoint configuration associated with the "InService" endpoint. Maybe it was removed when I click delete the model. However, it still cost me $18 after 1 day and maybe will continue costing in the next day. I can not delete the endpoint even with CLI, the error message: "An error occurred (ValidationException) when calling the DeleteEndpoint operation: Cannot delete endpoint with inference component associated. Please delete inference component and try it again."

zdev24 avatar Dec 27 '23 08:12 zdev24