seldon-core 503 under heaving inference load while during model replica change (scale up model replica)

If we have a relatively inference load on the system and if we increase the replica count of the model during this workload there is a potential 503.

This is on triton and tfsimple model

http_req_connecting,1668874179,0.000000,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_req_tls_handshaking,1668874179,0.000000,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_req_sending,1668874179,0.071701,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_req_waiting,1668874179,1.489771,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_req_receiving,1668874179,0.073131,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_req_failed,1668874179,1.000000,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_reqs,1668874179,1.000000,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,

Steps to replicate:

have 2 replica triton server
deploy 1 replica model tfsimple
drive inference load http ~ 2000 req/sec
apply replica change to 2 to tfsimple
notice 503

Envoy logs:

[2022-11-19T16:09:38.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.10:9001"
[2022-11-19T16:09:39.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.10:9001"
[2022-11-19T16:09:39.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.10:9001"
[2022-11-19T16:09:39.733Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 503 NC 0 0 0 - "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "-"
[2022-11-19T16:09:39.734Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 503 NC 0 0 0 - "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "-"
[2022-11-19T16:09:39.734Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 503 NC 0 0 0 - "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "-"
[2022-11-19T16:09:39.734Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 503 NC 0 0 0 - "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "-"
[2022-11-19T16:09:40.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.18:9001"
[2022-11-19T16:09:40.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.10:9001"
[2022-11-19T16:09:41.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.10:9001"

note: NC: Upstream cluster not found. from here

Jan 09 '23 11:01 sakoush

This can also happen if we have 2 models (inference model and explainer model) deployed on one instance, if this instance dies and these 2 models gets rescheduled there is going to be a race condition in getting them loaded. If the explainer model tries to load first before the inference model we can get 404 / 503

MLServer 1.2.0. To access the new field, you can either update the `settings.json` file, or update the `MLSERVER_PARALLEL_WORKERS` environment variable. The current value of the server-level's `parallel_workers` field is '1'.
2023-01-27 16:34:24,507 [mlserver.grpc] ERROR - Predictor failed to be called on x=['Hello world']. Check that `predictor` works with inputs of type List[str].
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_text.py", line 519, in _transform_predictor
    prediction = predictor(x)
  File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/explainers/black_box_runtime.py", line 75, in _infer_impl
    self.infer_metadata = remote_metadata(
  File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/common.py", line 76, in remote_metadata
    raise RemoteInferenceError(response_raw.status_code, response_raw.reason)
mlserver_alibi_explain.errors.RemoteInferenceError: Remote inference call failed with 503, Service Unavailable

Jan 27 '23 17:01 sakoush

This can also happen if we have 2 models (inference model and explainer model) deployed on one instance, if this instance dies and these 2 models gets rescheduled there is going to be a race condition in getting them loaded. If the explainer model tries to load first before the inference model we can get 404 / 503


MLServer 1.2.0. To access the new field, you can either update the `settings.json` file, or update the `MLSERVER_PARALLEL_WORKERS` environment variable. The current value of the server-level's `parallel_workers` field is '1'.

2023-01-27 16:34:24,507 [mlserver.grpc] ERROR - Predictor failed to be called on x=['Hello world']. Check that `predictor` works with inputs of type List[str].

Traceback (most recent call last):

  File "/opt/conda/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_text.py", line 519, in _transform_predictor

    prediction = predictor(x)

  File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/explainers/black_box_runtime.py", line 75, in _infer_impl

    self.infer_metadata = remote_metadata(

  File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/common.py", line 76, in remote_metadata

    raise RemoteInferenceError(response_raw.status_code, response_raw.reason)

mlserver_alibi_explain.errors.RemoteInferenceError: Remote inference call failed with 503, Service Unavailable

This specific issue of model and explainer race condition is now fixed by using retries.

Sep 03 '24 19:09 sakoush

seldon-core seldon-core copied to clipboard

503 under heaving inference load while during model replica change (scale up model replica)

seldon-core
seldon-core copied to clipboard