ml-commons
ml-commons copied to clipboard
[FEATURE] Replace blocking httpclient with async httpclient in remote inference
Is your feature request related to a problem?
Community user brings up a performance issue here, which reveals a performance issue in HttpClient of remote inference. The flow of the prediction can be illustrated as below:
There are two major issues here:
- The connection pool size of HttpClient is 20 by default which can cause
timeout waiting for connection
described here: https://github.com/opensearch-project/ml-commons/issues/1537. - The blocking HttpClient is a bottleneck since the predict pool thread size by default is
2 * num of vCPUs
, this isn't a big value since local model prediction is a CPU bound operation. But for remote inference, it's an IO bound operation and the thread pool size is relatively small. For issue1, we can enable user to update the configuration of max_connections to handle more parallel predict requests. For issue2, we can increase the predict thread pool size to bigger number to increase the parallelism, but this is not optimal because more threads would cause more context switch and degrade the overall system performance.
What solution would you like? Replace the blocking HttpClient with async HttpClient. With async HttpClient, both two issues above can be handled perfectly, there's no connection pool in async HttpClient and we don't need to change the default predict thread pool size since async HttpClient has better performance with only a few threads. AWS async HttpClient: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/http-configuration-crt.html
What alternatives have you considered? Increasing the predict thread pool size and make this a system setting and configurable to user.
Do you have any additional context? NA
@model-collapse @ylwu-amzn @dhrubo-os @austintlee Please chime in.
there's no connection pool in async HttpClient
Could you please explain why? In your provided link, I can see maxConcurrency(100)
was set for both (async and sync)
A mistake here, async httpclient also has connection pools, the 100 is an example value.
Benchmark results of replacing sync http-client to async
Test settings
commons settings
- Benchmark doc count: 100k
- One SageMaker endpoint with node type: ml.r5.4xlarge. This node has 16 vCPUs so the full CPU utilization should be 1600%.
-
sync/async httpclient cluster
-
One data node with type m5.xlarge
Results
Sync httpclient benchmark result
bulk size: 200
Profile result
{
"models": {
"DjD5U40BcDj4M4xaapQ-": {
"target_worker_nodes": [
"47xFKefyT_yT4ruLRNVysQ"
],
"worker_nodes": [
"47xFKefyT_yT4ruLRNVysQ"
],
"nodes": {
"47xFKefyT_yT4ruLRNVysQ": {
"model_state": "DEPLOYED",
"predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@267e2248",
"model_inference_stats": {
"count": 100001,
"max": 110.231529,
"min": 13.642701,
"average": 31.23418597099029,
"p50": 29.00999,
"p90": 41.798205,
"p99": 58.968204
},
"predict_request_stats": {
"count": 100001,
"max": 6783.721085,
"min": 69.897231,
"average": 5924.886211584014,
"p50": 5986.277747,
"p90": 6370.810034,
"p99": 6612.146908
}
}
}
}
}
}
Benchmark result
| Segment count | | 162 | |
| Min Throughput | bulk | 69.28 | docs/s |
| Mean Throughput | bulk | 258.52 | docs/s |
| Median Throughput | bulk | 255.26 | docs/s |
| Max Throughput | bulk | 353.37 | docs/s |
| 50th percentile latency | bulk | 6462.37 | ms |
| 90th percentile latency | bulk | 6686.9 | ms |
| 99th percentile latency | bulk | 6815.03 | ms |
| 100th percentile latency | bulk | 6845.9 | ms |
| 50th percentile service time | bulk | 6462.37 | ms |
| 90th percentile service time | bulk | 6686.9 | ms |
| 99th percentile service time | bulk | 6815.03 | ms |
| 100th percentile service time | bulk | 6845.9 | ms |
bulk size: 800
Profile result
{
"models": {
"DjD5U40BcDj4M4xaapQ-": {
"target_worker_nodes": [
"47xFKefyT_yT4ruLRNVysQ"
],
"worker_nodes": [
"47xFKefyT_yT4ruLRNVysQ"
],
"nodes": {
"47xFKefyT_yT4ruLRNVysQ": {
"model_state": "DEPLOYED",
"predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@63aef70",
"model_inference_stats": {
"count": 100000,
"max": 120.595407,
"min": 13.983231,
"average": 31.37104606054,
"p50": 29.143390500000002,
"p90": 41.9421519,
"p99": 58.93499283
},
"predict_request_stats": {
"count": 100000,
"max": 26867.224684,
"min": 72.765313,
"average": 23015.738118801088,
"p50": 23926.2854905,
"p90": 25538.6665085,
"p99": 26391.06232507
}
}
}
}
}
}
Benchmark result
| Segment count | | 162 | |
| Min Throughput | bulk | 58.4 | docs/s |
| Mean Throughput | bulk | 267.75 | docs/s |
| Median Throughput | bulk | 263.69 | docs/s |
| Max Throughput | bulk | 320.74 | docs/s |
| 50th percentile latency | bulk | 25708.2 | ms |
| 90th percentile latency | bulk | 26744.4 | ms |
| 99th percentile latency | bulk | 27030.1 | ms |
| 100th percentile latency | bulk | 27085.8 | ms |
| 50th percentile service time | bulk | 25708.2 | ms |
| 90th percentile service time | bulk | 26744.4 | ms |
| 99th percentile service time | bulk | 27030.1 | ms |
| 100th percentile service time | bulk | 27085.8 | ms |
Take aways
With even higher bulk size, prediction throughput is not changed, but latency increased, which means more queuing happened. This can be proved by the profile result ‘s predict_request_stats p90/p99.
Async httpclient benchmark result
bulk size: 200
profile result
{
"models": {
"z7cLVI0BDnAEuuAYMD8k": {
"target_worker_nodes": [
"66XjHiy0TluuW0WSu3EsPg"
],
"worker_nodes": [
"66XjHiy0TluuW0WSu3EsPg"
],
"nodes": {
"66XjHiy0TluuW0WSu3EsPg": {
"model_state": "DEPLOYED",
"predictor": "org.opensearch.ml.engine.algorithms.remote.RemoteModel@72193f1d",
"model_inference_stats": {
"count": 100000,
"max": 4298.986295,
"min": 927.558941,
"average": 3686.23973879843,
"p50": 3732.266947,
"p90": 3942.6228318999997,
"p99": 4051.5752754
},
"predict_request_stats": {
"count": 100000,
"max": 4349.452111,
"min": 1238.687734,
"average": 3689.50676293382,
"p50": 3733.923404,
"p90": 3943.9578763,
"p99": 4053.8236341399997
}
}
}
}
}
}
benchmark result
| Segment count | | 177 | |
| Min Throughput | bulk | 44 | docs/s |
| Mean Throughput | bulk | 372.27 | docs/s |
| Median Throughput | bulk | 380.18 | docs/s |
| Max Throughput | bulk | 383.27 | docs/s |
| 50th percentile latency | bulk | 4123.04 | ms |
| 90th percentile latency | bulk | 4213.35 | ms |
| 99th percentile latency | bulk | 4642.52 | ms |
| 100th percentile latency | bulk | 6369.56 | ms |
| 50th percentile service time | bulk | 4123.04 | ms |
| 90th percentile service time | bulk | 4213.35 | ms |
| 99th percentile service time | bulk | 4642.52 | ms |
| 100th percentile service time | bulk | 6369.56 | ms |
SageMaker CPU usage
Latency comparison
E2E latency is also dropped by 37%
with same bulk size in async httpclient.
The reason is the predict task is no longer waiting in the ml-commons predict_thread_pool queue, the waiting time is eliminated.
sync httpclient
- bulk size 200 has 90%ile e2e latency: 6686.9 ms
Async httpclient
- bulk size 200 has 90%ile e2e latency: 4213.35 ms
Looking forward to this improvement! The expected improvements are very promising.
@zane-neo Can you help test fine tune the thread pool size can help for sync client?
@ylwu-amzn, fine tune the thread pool size definitely can improve the sync http client performance but this is not optimal, threads need system resources and also more threads will increase the thread context switch overhead, in the end this will reach to a new performance bottleneck. Using async httpclient can make sure no system resources consumption and can handle very high performance so I think we should go this way.