chproxy
chproxy copied to clipboard
[BUG] CPU usage increased a lot since 1.17.0
Describe the bug
CF issue https://github.com/ContentSquare/chproxy/issues/222, the cpu usage since version 1.17.0 is bigger.
With a new version we have noticeably higher CPU usage (around 60%/core in average, befefore it was close to 0), but it's still acceptable.
This new behavior might be normal since the version 1.17.0 process queries much faster.
The idea of this task is to investigate if 1.17.0 introduced a useless CPU bottleneck.
Hey, I'm wondering if anything came of this. We've been running some benchmarks and I'm having some concern because we're seeing consistent 60%-70% utilization on a 16 core machine, processing about 1,300 qps peak.
The queries themselves are relatively lightweight and don't entail a ton of resources, memory usage stays around 35MiB. Is this the expected behavior?
Here's the memory utilization:
The speed is good but if we project this utilization out over our actual Clickhouse cluster, we'd actually be supplying more resources to CHProxy than to the python service connecting to it.
The debug logs don't offer a ton of insight here, but I did try disabling any cache to see if perhaps there were cycles being wasted trying to purge cache directories that could account for the utilization but I didn't see a change there. This seems worthy of a pprof
- perhaps it's just a classic GC issue that might be easily resolved?
I was able to improve performance slightly by setting GOMAXPROCS
, this is deployed onto kubernetes and the node was reporting 32cpu available, even though it is only being allocated 30.
https://github.com/uber-go/automaxprocs might be a good option here.
Regardless, the same characteristic GC pauses (I think) still exist and CPU utilization is very high
At this point, I think I might actually need a bigger machine to test this farther. It seems that we're actually maxing out CPU util on the 30 available cores, I'm not sure how far it'll grow 👀
can you try the same benchmark with a smaller machine? Since you're doing up to 1300 QPS, it can be the garbage collection that uses as much resource as possible.
FYI, in a previous issue regarding performance (that was fixed since), the guy was able to have a much better QPS/CPU ratio (2 CPU core for an avg QPS at 800-850): https://github.com/ContentSquare/chproxy/issues/264
can you try the same benchmark with a smaller machine? Since you're doing up to 1300 QPS, it can be the garbage collection that uses as much resource as possible.
FYI, in a previous issue regarding performance (that was fixed since), the guy was able to have a much better QPS/CPU ratio (2 CPU core for an avg QPS at 800-850): #264
So interesting discovery here. Please note we scaled the ch cluster since that last round of benchmarks, so this set of results isn't quite comparable to yesterday. I went ahead and re-ran the benchmarks on 30 cores, 16, 8, 6, and 4.
30 CPU allocated
Type | Name | Request Count | Failure Count | Median Response Time | Average Response Time | Min Response Time | Max Response Time | Average Content Size | Requests/s | Failures/s | 50% | 66% | 75% | 80% | 90% | 95% | 98% | 99% | 99.9% | 99.99% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
grpc | query_metrics | 465473 | 0 | 45 | 51.2230113525825 | 21.590776996163200 | 298.78877400187800 | 2.9337491111192300 | 1536.1861360607900 | 0.0 | 45 | 54 | 61 | 66 | 82 | 110 | 130 | 150 | 200 | 240 | 300 |
Aggregated | 465473 | 0 | 45 | 51.2230113525825 | 21.590776996163200 | 298.78877400187800 | 2.9337491111192300 | 1536.1861360607900 | 0.0 | 45 | 54 | 61 | 66 | 82 | 110 | 130 | 150 | 200 | 240 | 300 |
16 CPU allocated
Type | Name | Request Count | Failure Count | Median Response Time | Average Response Time | Min Response Time | Max Response Time | Average Content Size | Requests/s | Failures/s | 50% | 66% | 75% | 80% | 90% | 95% | 98% | 99% | 99.9% | 99.99% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
grpc | query_metrics | 465740 | 0 | 45 | 51.18937875467220 | 21.675225005310500 | 296.39671899349200 | 2.94348134152102 | 1537.0889316798300 | 0.0 | 45 | 54 | 61 | 66 | 82 | 110 | 130 | 150 | 190 | 240 | 300 |
Aggregated | 465740 | 0 | 45 | 51.18937875467220 | 21.675225005310500 | 296.39671899349200 | 2.94348134152102 | 1537.0889316798300 | 0.0 | 45 | 54 | 61 | 66 | 82 | 110 | 130 | 150 | 190 | 240 | 300 |
8 CPU allocated
Type | Name | Request Count | Failure Count | Median Response Time | Average Response Time | Min Response Time | Max Response Time | Average Content Size | Requests/s | Failures/s | 50% | 66% | 75% | 80% | 90% | 95% | 98% | 99% | 99.9% | 99.99% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
grpc | query_metrics | 469484 | 0 | 44 | 50.778038869628400 | 20.85051400354130 | 326.78853499965000 | 2.967626159783930 | 1549.4528706400000 | 0.0 | 44 | 53 | 61 | 66 | 82 | 110 | 130 | 150 | 200 | 250 | 330 |
Aggregated | 469484 | 0 | 44 | 50.778038869628400 | 20.85051400354130 | 326.78853499965000 | 2.967626159783930 | 1549.4528706400000 | 0.0 | 44 | 53 | 61 | 66 | 82 | 110 | 130 | 150 | 200 | 250 | 330 |
6 CPU allocated
Type | Name | Request Count | Failure Count | Median Response Time | Average Response Time | Min Response Time | Max Response Time | Average Content Size | Requests/s | Failures/s | 50% | 66% | 75% | 80% | 90% | 95% | 98% | 99% | 99.9% | 99.99% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
grpc | query_metrics | 472992 | 0 | 44 | 50.40309555482100 | 20.119906999752900 | 317.3640710010660 | 2.9199627055003000 | 1576.63319874048 | 0.0 | 44 | 53 | 60 | 65 | 81 | 110 | 130 | 150 | 190 | 240 | 320 |
Aggregated | 472992 | 0 | 44 | 50.40309555482100 | 20.119906999752900 | 317.3640710010660 | 2.9199627055003000 | 1576.63319874048 | 0.0 | 44 | 53 | 60 | 65 | 81 | 110 | 130 | 150 | 190 | 240 | 320 |
4 CPU allocated
Type | Name | Request Count | Failure Count | Median Response Time | Average Response Time | Min Response Time | Max Response Time | Average Content Size | Requests/s | Failures/s | 50% | 66% | 75% | 80% | 90% | 95% | 98% | 99% | 99.9% | 99.99% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
grpc | query_metrics | 473686 | 0 | 44 | 50.3294800098238 | 20.29696999670710 | 382.6167549996170 | 2.919239749538720 | 1563.3277452262800 | 0.0 | 44 | 53 | 60 | 65 | 81 | 100 | 130 | 140 | 190 | 250 | 380 |
Aggregated | 473686 | 0 | 44 | 50.3294800098238 | 20.29696999670710 | 382.6167549996170 | 2.919239749538720 | 1563.3277452262800 | 0.0 | 44 | 53 | 60 | 65 | 81 | 100 | 130 | 140 | 190 | 250 | 380 |
Please note, I have set GOMAXPROCS
in each of these to the number of free cores.
While I am sort of amused at the performance characteristics here, I think it would be worthwhile to try to run this down, would you agree or does it not seem like a priority issue?
My concern is that we wouldn't want to scale our proxy only horizontally, that has negative implications in terms of open connections and the concurrency limits. e.g. 6 proxy nodes, we need to limit the concurrent query to 1/6 for each node (assuming we're still using the default ch concurrency of 100). If we continue to scale out, we would find ourselves limiting that per-proxy concurrency rule farther and might see less even query distribution than we get with a smaller set of proxy nodes.