ml-commons icon indicating copy to clipboard operation
ml-commons copied to clipboard

[FEATURE] Replace blocking httpclient with async httpclient in remote inference

Open zane-neo opened this issue 1 year ago • 7 comments

Is your feature request related to a problem? Community user brings up a performance issue here, which reveals a performance issue in HttpClient of remote inference. The flow of the prediction can be illustrated as below: blocking-httpclient drawio (1)

There are two major issues here:

  1. The connection pool size of HttpClient is 20 by default which can cause timeout waiting for connection described here: https://github.com/opensearch-project/ml-commons/issues/1537.
  2. The blocking HttpClient is a bottleneck since the predict pool thread size by default is 2 * num of vCPUs, this isn't a big value since local model prediction is a CPU bound operation. But for remote inference, it's an IO bound operation and the thread pool size is relatively small. For issue1, we can enable user to update the configuration of max_connections to handle more parallel predict requests. For issue2, we can increase the predict thread pool size to bigger number to increase the parallelism, but this is not optimal because more threads would cause more context switch and degrade the overall system performance.

What solution would you like? Replace the blocking HttpClient with async HttpClient. With async HttpClient, both two issues above can be handled perfectly, there's no connection pool in async HttpClient and we don't need to change the default predict thread pool size since async HttpClient has better performance with only a few threads. AWS async HttpClient: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/http-configuration-crt.html

What alternatives have you considered? Increasing the predict thread pool size and make this a system setting and configurable to user.

Do you have any additional context? NA

zane-neo avatar Jan 05 '24 03:01 zane-neo

@model-collapse @ylwu-amzn @dhrubo-os @austintlee Please chime in.

zane-neo avatar Jan 05 '24 03:01 zane-neo

there's no connection pool in async HttpClient

Could you please explain why? In your provided link, I can see maxConcurrency(100) was set for both (async and sync)

dhrubo-os avatar Jan 17 '24 11:01 dhrubo-os

A mistake here, async httpclient also has connection pools, the 100 is an example value.

zane-neo avatar Jan 19 '24 06:01 zane-neo

Benchmark results of replacing sync http-client to async

Test settings

commons settings

  • Benchmark doc count: 100k
  • One SageMaker endpoint with node type: ml.r5.4xlarge. This node has 16 vCPUs so the full CPU utilization should be 1600%.
  • sync/async httpclient cluster

  • One data node with type m5.xlarge

Results

Sync httpclient benchmark result

bulk size: 200

Profile result

{
    "models": {
        "DjD5U40BcDj4M4xaapQ-": {
            "target_worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "nodes": {
                "47xFKefyT_yT4ruLRNVysQ": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@267e2248",
                    "model_inference_stats": {
                        "count": 100001,
                        "max": 110.231529,
                        "min": 13.642701,
                        "average": 31.23418597099029,
                        "p50": 29.00999,
                        "p90": 41.798205,
                        "p99": 58.968204
                    },
                    "predict_request_stats": {
                        "count": 100001,
                        "max": 6783.721085,
                        "min": 69.897231,
                        "average": 5924.886211584014,
                        "p50": 5986.277747,
                        "p90": 6370.810034,
                        "p99": 6612.146908
                    }
                }
            }
        }
    }
}

Benchmark result

|                                                  Segment count |        |         162 |        |
|                                                 Min Throughput |   bulk |       69.28 | docs/s |
|                                                Mean Throughput |   bulk |      258.52 | docs/s |
|                                              Median Throughput |   bulk |      255.26 | docs/s |
|                                                 Max Throughput |   bulk |      353.37 | docs/s |
|                                        50th percentile latency |   bulk |     6462.37 |     ms |
|                                        90th percentile latency |   bulk |      6686.9 |     ms |
|                                        99th percentile latency |   bulk |     6815.03 |     ms |
|                                       100th percentile latency |   bulk |      6845.9 |     ms |
|                                   50th percentile service time |   bulk |     6462.37 |     ms |
|                                   90th percentile service time |   bulk |      6686.9 |     ms |
|                                   99th percentile service time |   bulk |     6815.03 |     ms |
|                                  100th percentile service time |   bulk |      6845.9 |     ms |

bulk size: 800

Profile result

{
    "models": {
        "DjD5U40BcDj4M4xaapQ-": {
            "target_worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "worker_nodes": [
                "47xFKefyT_yT4ruLRNVysQ"
            ],
            "nodes": {
                "47xFKefyT_yT4ruLRNVysQ": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@63aef70",
                    "model_inference_stats": {
                        "count": 100000,
                        "max": 120.595407,
                        "min": 13.983231,
                        "average": 31.37104606054,
                        "p50": 29.143390500000002,
                        "p90": 41.9421519,
                        "p99": 58.93499283
                    },
                    "predict_request_stats": {
                        "count": 100000,
                        "max": 26867.224684,
                        "min": 72.765313,
                        "average": 23015.738118801088,
                        "p50": 23926.2854905,
                        "p90": 25538.6665085,
                        "p99": 26391.06232507
                    }
                }
            }
        }
    }
}

Benchmark result

|                                                  Segment count |        |         162 |        |
|                                                 Min Throughput |   bulk |        58.4 | docs/s |
|                                                Mean Throughput |   bulk |      267.75 | docs/s |
|                                              Median Throughput |   bulk |      263.69 | docs/s |
|                                                 Max Throughput |   bulk |      320.74 | docs/s |
|                                        50th percentile latency |   bulk |     25708.2 |     ms |
|                                        90th percentile latency |   bulk |     26744.4 |     ms |
|                                        99th percentile latency |   bulk |     27030.1 |     ms |
|                                       100th percentile latency |   bulk |     27085.8 |     ms |
|                                   50th percentile service time |   bulk |     25708.2 |     ms |
|                                   90th percentile service time |   bulk |     26744.4 |     ms |
|                                   99th percentile service time |   bulk |     27030.1 |     ms |
|                                  100th percentile service time |   bulk |     27085.8 |     ms |

Take aways

With even higher bulk size, prediction throughput is not changed, but latency increased, which means more queuing happened. This can be proved by the profile result ‘s predict_request_stats p90/p99.

Async httpclient benchmark result

bulk size: 200

profile result

{
    "models": {
        "z7cLVI0BDnAEuuAYMD8k": {
            "target_worker_nodes": [
                "66XjHiy0TluuW0WSu3EsPg"
            ],
            "worker_nodes": [
                "66XjHiy0TluuW0WSu3EsPg"
            ],
            "nodes": {
                "66XjHiy0TluuW0WSu3EsPg": {
                    "model_state": "DEPLOYED",
                    "predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@72193f1d",
                    "model_inference_stats": {
                        "count": 100000,
                        "max": 4298.986295,
                        "min": 927.558941,
                        "average": 3686.23973879843,
                        "p50": 3732.266947,
                        "p90": 3942.6228318999997,
                        "p99": 4051.5752754
                    },
                    "predict_request_stats": {
                        "count": 100000,
                        "max": 4349.452111,
                        "min": 1238.687734,
                        "average": 3689.50676293382,
                        "p50": 3733.923404,
                        "p90": 3943.9578763,
                        "p99": 4053.8236341399997
                    }
                }
            }
        }
    }
}

benchmark result

|                                                  Segment count |        |         177 |        |
|                                                 Min Throughput |   bulk |          44 | docs/s |
|                                                Mean Throughput |   bulk |      372.27 | docs/s |
|                                              Median Throughput |   bulk |      380.18 | docs/s |
|                                                 Max Throughput |   bulk |      383.27 | docs/s |
|                                        50th percentile latency |   bulk |     4123.04 |     ms |
|                                        90th percentile latency |   bulk |     4213.35 |     ms |
|                                        99th percentile latency |   bulk |     4642.52 |     ms |
|                                       100th percentile latency |   bulk |     6369.56 |     ms |
|                                   50th percentile service time |   bulk |     4123.04 |     ms |
|                                   90th percentile service time |   bulk |     4213.35 |     ms |
|                                   99th percentile service time |   bulk |     4642.52 |     ms |
|                                  100th percentile service time |   bulk |     6369.56 |     ms |

SageMaker CPU usage

Screenshot 2024-01-29 at 16 14 02 With Async httpclient and 200 bulk size, the prediction throughput reaches 1.6k% which means SageMaker CPU is fully utilized.

Latency comparison

E2E latency is also dropped by 37% with same bulk size in async httpclient. The reason is the predict task is no longer waiting in the ml-commons predict_thread_pool queue, the waiting time is eliminated.

sync httpclient

  • bulk size 200 has 90%ile e2e latency: 6686.9 ms

Async httpclient

  • bulk size 200 has 90%ile e2e latency: 4213.35 ms

zane-neo avatar Jan 29 '24 08:01 zane-neo

Looking forward to this improvement! The expected improvements are very promising.

juntezhang avatar Mar 07 '24 14:03 juntezhang

@zane-neo Can you help test fine tune the thread pool size can help for sync client?

ylwu-amzn avatar Mar 14 '24 03:03 ylwu-amzn

@ylwu-amzn, fine tune the thread pool size definitely can improve the sync http client performance but this is not optimal, threads need system resources and also more threads will increase the thread context switch overhead, in the end this will reach to a new performance bottleneck. Using async httpclient can make sure no system resources consumption and can handle very high performance so I think we should go this way.

zane-neo avatar Mar 14 '24 04:03 zane-neo