modelmesh-serving icon indicating copy to clipboard operation
modelmesh-serving copied to clipboard

Failed to pull model from storage due to error: unable to download objects in bucket ... context canceled

Open yc2984 opened this issue 1 year ago • 4 comments

Hi, I'm getting this error: Message: Failed to pull model from storage due to error: unable to download objects in bucket '<bucket_name>': failed to write data to file(/models/random-forest-model-170mb-72__isvc-740db5a676/random_forest_model.joblib): from object(MODELS/performance_test_models/random_forest_model.joblib) in bucket(< bucket_name>): context canceled

This error seems to be resolved when I re-deploy the entire helm release, which triggers the re-deployment of the models, but it starts to appear again after running a few performance tests.

At first I thought it was due to the large amount of models deployed, ~5000 models of 170 MB. Since the model mentioned in the error message random-forest-model-170mb-72 is not even the one that is invoked.

I suspected it was due to not enough volume space due to its configured to be 1.5 times the memory limit. https://github.com/kserve/modelmesh-serving/blob/3f018571f0dd79bb9905dde6855642c73478c048/controllers/modelmesh/runtime.go#L115 However, I could reproduce it even with very small amount of models, ~100 models. I have 4 replicas of MLServer, each with 4Gi memory request and 8GI memory limit. This is a much lower load than what I have tested before: https://github.com/kserve/modelmesh-serving/issues/341#issuecomment-1460577628

Could it be due to the upgrade, we upgraded from v0.9 to v0.10. Any suggestions?

yc2984 avatar May 08 '23 19:05 yc2984

Hello, Thanks for including details in your issue report!

If the empty dir's space was being exceeded, I would expect the pod to crash with an error making that clear (such as was observed in https://github.com/kserve/modelmesh-runtime-adapter/issues/48). The context canceled error sounds like it is more likely a timeout being hit. There is a model load timeout for an InferenceService that is set to 90s (REF).

To check if the 90s timeout is being hit, you could look at the logs in the mlserver-adapter container and look at the timestamps for when the model starts loading compared to when the download fails.

It sounds like this issue is only observed under heavy load from performance testing?

tjohnson31415 avatar May 08 '23 23:05 tjohnson31415

@tjohnson31415 thanks for replying.

To check if the 90s timeout is being hit, you could look at the logs in the mlserver-adapter container and look at the timestamps for when the model starts loading compared to when the download fails.

I don't think it's been hit since the our grpc client also has a timeout of 90 seconds and also since the result is reproducible by invoking the model manually when there is no load and the result is instant.

It happens even under moderate load. This is the result of calling 100 models each 100 times. I attached the complete log here, please search for "ERROR". The first calls after a new helm release deployment goes well, and this errors start, performance deteriorate fairly quickly.

{
    "start_time": "2023-05-09 07:34:49",
    "end_time": "2023-05-09 07:36:16",
    "parameters": {
        "number_of_requests": 10000,
        "actual_number_of_requests": 10020,
        "calls_per_model": 100,
        "concurrency": 30,
        "rate_limit_per_worker": 30,
        "model_base_name": "random-forest-model-170mb",
        "feature_size": 673
    },
    "test_results": {
        "successful_requests": 6984,
        "error_requests": 2016,
        "success_rate": 0.6970059880239521,
        "error_types": {
            "StatusCode.INTERNAL": 2016
        },
        "actual_rps": 114.84352311773985,
        "total_duration_seconds": 87.07500195503235,
        "success_latencies": {
            "average": 0.29126213945412793,
            "min": 0.024263282772153616,
            "max": 27.292053163982928,
            "percentile_50": 0.21680533594917506,
            "percentile_90": 0.39713908629491923,
            "percentile_95": 0.48546986506553347,
            "percentile_99": 0.8819075477402659
        },
        "failure_latencies": {
            "average": 0.06883279083431551,
            "min": 0.001514866016805172,
            "max": 30.191851013107225,
            "percentile_50": 0.004282854497432709,
            "percentile_90": 0.01343291497323662,
            "percentile_95": 0.0265065876301378,
            "percentile_99": 0.21781385053181995
        }
    }
}

And this error seems not recoverable, i.e. once this happens to a model, the model becomes unavailable forever (or at least for quite a long time) with the exact same error, I can reproduce it by calling it by grpcurl.

Also, notice that the error is not for the model that is invoked, random-forest-model-170mb-2 vs random-forest-model-170mb-0

ERROR MLServer Adapter.MLServer Adapter Server.LoadModel Failed to pull model from storage {"modelId": "random-forest-model-170mb-2__isvc-740db5a676", "error": "rpc error: code = Unknown desc = Failed to pull model from storage due to error: unable to download objects in bucket '<my_bucket>': failed to write data to file(/models/random-forest-model-170mb-0__isvc-740db5a676/random_forest_model.joblib): from object(MODELS/performance_test_models/random_forest_model.joblib) in bucket(<my_bucket>): context canceled"}

I'm also getting this error on the mm container from time to time: Warning Unhealthy 92s (x13 over 3m32s) kubelet │ │ Readiness probe failed: HTTP probe failed with statuscode: 503

In the meantime, I'm going to test with an older version of modelmesh and do the same test.

yc2984 avatar May 09 '23 08:05 yc2984

It seems that this error is related the persistence volume we added to the etcd deployment. When I revert to the version without persistence volume, the performance is back to normal and much more stable, only with some normal DEADLINE_EXCEEDED and 8 retry iterations exhausted error, same as here: https://github.com/kserve/modelmesh-serving/issues/341#issuecomment-1460577628

There might be something wrong with our etcd pvc set up? Here I link the etcd deployment, pvc, and pv

yc2984 avatar May 09 '23 20:05 yc2984

Possibly related issue https://github.com/kserve/modelmesh-runtime-adapter/issues/54 and fix https://github.com/kserve/modelmesh-runtime-adapter/pull/55

We need to re-asses if this is still a problem

ckadner avatar Jan 19 '24 20:01 ckadner