seldon-core
seldon-core copied to clipboard
503 under heaving inference load while during model replica change (scale up model replica)
If we have a relatively inference load on the system and if we increase the replica count of the model during this workload there is a potential 503.
This is on triton and tfsimple model
http_req_connecting,1668874179,0.000000,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_req_tls_handshaking,1668874179,0.000000,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_req_sending,1668874179,0.071701,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_req_waiting,1668874179,1.489771,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_req_receiving,1668874179,0.073131,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_req_failed,1668874179,1.000000,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
http_reqs,1668874179,1.000000,,,1503,false,,POST,http://172.18.255.1:80/v2/models/tfsimple/infer,HTTP/1.1,constant_request_rate,,503,,,http://172.18.255.1:80/v2/models/tfsimple/infer,,
Steps to replicate:
- have 2 replica triton server
- deploy 1 replica model tfsimple
- drive inference load http ~ 2000 req/sec
- apply replica change to 2 to tfsimple
- notice 503
Envoy logs:
[2022-11-19T16:09:38.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.10:9001"
[2022-11-19T16:09:39.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.10:9001"
[2022-11-19T16:09:39.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.10:9001"
[2022-11-19T16:09:39.733Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 503 NC 0 0 0 - "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "-"
[2022-11-19T16:09:39.734Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 503 NC 0 0 0 - "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "-"
[2022-11-19T16:09:39.734Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 503 NC 0 0 0 - "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "-"
[2022-11-19T16:09:39.734Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 503 NC 0 0 0 - "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "-"
[2022-11-19T16:09:40.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.18:9001"
[2022-11-19T16:09:40.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.10:9001"
[2022-11-19T16:09:41.503Z] "POST /v2/models/tfsimple/infer HTTP/1.1" 200 - 210 258 0 0 "-" "k6/0.41.0 (https://k6.io/)" "-" "tfsimple" "10.244.0.10:9001"
note: NC: Upstream cluster not found. from here
This can also happen if we have 2 models (inference model and explainer model) deployed on one instance, if this instance dies and these 2 models gets rescheduled there is going to be a race condition in getting them loaded. If the explainer model tries to load first before the inference model we can get 404 / 503
MLServer 1.2.0. To access the new field, you can either update the `settings.json` file, or update the `MLSERVER_PARALLEL_WORKERS` environment variable. The current value of the server-level's `parallel_workers` field is '1'.
2023-01-27 16:34:24,507 [mlserver.grpc] ERROR - Predictor failed to be called on x=['Hello world']. Check that `predictor` works with inputs of type List[str].
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_text.py", line 519, in _transform_predictor
prediction = predictor(x)
File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/explainers/black_box_runtime.py", line 75, in _infer_impl
self.infer_metadata = remote_metadata(
File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/common.py", line 76, in remote_metadata
raise RemoteInferenceError(response_raw.status_code, response_raw.reason)
mlserver_alibi_explain.errors.RemoteInferenceError: Remote inference call failed with 503, Service Unavailable
This can also happen if we have 2 models (inference model and explainer model) deployed on one instance, if this instance dies and these 2 models gets rescheduled there is going to be a race condition in getting them loaded. If the explainer model tries to load first before the inference model we can get 404 / 503
MLServer 1.2.0. To access the new field, you can either update the `settings.json` file, or update the `MLSERVER_PARALLEL_WORKERS` environment variable. The current value of the server-level's `parallel_workers` field is '1'. 2023-01-27 16:34:24,507 [mlserver.grpc] ERROR - Predictor failed to be called on x=['Hello world']. Check that `predictor` works with inputs of type List[str]. Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_text.py", line 519, in _transform_predictor prediction = predictor(x) File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/explainers/black_box_runtime.py", line 75, in _infer_impl self.infer_metadata = remote_metadata( File "/opt/conda/lib/python3.8/site-packages/mlserver_alibi_explain/common.py", line 76, in remote_metadata raise RemoteInferenceError(response_raw.status_code, response_raw.reason) mlserver_alibi_explain.errors.RemoteInferenceError: Remote inference call failed with 503, Service Unavailable
This specific issue of model and explainer race condition is now fixed by using retries.