chproxy [BUG] CPU usage increased a lot since 1.17.0

Describe the bug CF issue https://github.com/ContentSquare/chproxy/issues/222, the cpu usage since version 1.17.0 is bigger. With a new version we have noticeably higher CPU usage (around 60%/core in average, befefore it was close to 0), but it's still acceptable. This new behavior might be normal since the version 1.17.0 process queries much faster. The idea of this task is to investigate if 1.17.0 introduced a useless CPU bottleneck.

Sep 13 '22 09:09 mga-chka

Hey, I'm wondering if anything came of this. We've been running some benchmarks and I'm having some concern because we're seeing consistent 60%-70% utilization on a 16 core machine, processing about 1,300 qps peak.

The queries themselves are relatively lightweight and don't entail a ton of resources, memory usage stays around 35MiB. Is this the expected behavior?

Here's the memory utilization:

The speed is good but if we project this utilization out over our actual Clickhouse cluster, we'd actually be supplying more resources to CHProxy than to the python service connecting to it.

The debug logs don't offer a ton of insight here, but I did try disabling any cache to see if perhaps there were cycles being wasted trying to purge cache directories that could account for the utilization but I didn't see a change there. This seems worthy of a pprof - perhaps it's just a classic GC issue that might be easily resolved?

Aug 03 '23 04:08 matthax

I was able to improve performance slightly by setting GOMAXPROCS, this is deployed onto kubernetes and the node was reporting 32cpu available, even though it is only being allocated 30.

https://github.com/uber-go/automaxprocs might be a good option here.

Regardless, the same characteristic GC pauses (I think) still exist and CPU utilization is very high

At this point, I think I might actually need a bigger machine to test this farther. It seems that we're actually maxing out CPU util on the 30 available cores, I'm not sure how far it'll grow 👀

Aug 03 '23 07:08 matthax

can you try the same benchmark with a smaller machine? Since you're doing up to 1300 QPS, it can be the garbage collection that uses as much resource as possible.

FYI, in a previous issue regarding performance (that was fixed since), the guy was able to have a much better QPS/CPU ratio (2 CPU core for an avg QPS at 800-850): https://github.com/ContentSquare/chproxy/issues/264

Aug 03 '23 09:08 mga-chka

can you try the same benchmark with a smaller machine? Since you're doing up to 1300 QPS, it can be the garbage collection that uses as much resource as possible.

FYI, in a previous issue regarding performance (that was fixed since), the guy was able to have a much better QPS/CPU ratio (2 CPU core for an avg QPS at 800-850): #264

So interesting discovery here. Please note we scaled the ch cluster since that last round of benchmarks, so this set of results isn't quite comparable to yesterday. I went ahead and re-ran the benchmarks on 30 cores, 16, 8, 6, and 4.

30 CPU allocated

Type	Name	Request Count	Failure Count	Median Response Time	Average Response Time	Min Response Time	Max Response Time	Average Content Size	Requests/s	Failures/s	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
grpc	query_metrics	465473	0	45	51.2230113525825	21.590776996163200	298.78877400187800	2.9337491111192300	1536.1861360607900	0.0	45	54	61	66	82	110	130	150	200	240	300
	Aggregated	465473	0	45	51.2230113525825	21.590776996163200	298.78877400187800	2.9337491111192300	1536.1861360607900	0.0	45	54	61	66	82	110	130	150	200	240	300

16 CPU allocated

Type	Name	Request Count	Failure Count	Median Response Time	Average Response Time	Min Response Time	Max Response Time	Average Content Size	Requests/s	Failures/s	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
grpc	query_metrics	465740	0	45	51.18937875467220	21.675225005310500	296.39671899349200	2.94348134152102	1537.0889316798300	0.0	45	54	61	66	82	110	130	150	190	240	300
	Aggregated	465740	0	45	51.18937875467220	21.675225005310500	296.39671899349200	2.94348134152102	1537.0889316798300	0.0	45	54	61	66	82	110	130	150	190	240	300

8 CPU allocated

Type	Name	Request Count	Failure Count	Median Response Time	Average Response Time	Min Response Time	Max Response Time	Average Content Size	Requests/s	Failures/s	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
grpc	query_metrics	469484	0	44	50.778038869628400	20.85051400354130	326.78853499965000	2.967626159783930	1549.4528706400000	0.0	44	53	61	66	82	110	130	150	200	250	330
	Aggregated	469484	0	44	50.778038869628400	20.85051400354130	326.78853499965000	2.967626159783930	1549.4528706400000	0.0	44	53	61	66	82	110	130	150	200	250	330

6 CPU allocated

Type	Name	Request Count	Failure Count	Median Response Time	Average Response Time	Min Response Time	Max Response Time	Average Content Size	Requests/s	Failures/s	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
grpc	query_metrics	472992	0	44	50.40309555482100	20.119906999752900	317.3640710010660	2.9199627055003000	1576.63319874048	0.0	44	53	60	65	81	110	130	150	190	240	320
	Aggregated	472992	0	44	50.40309555482100	20.119906999752900	317.3640710010660	2.9199627055003000	1576.63319874048	0.0	44	53	60	65	81	110	130	150	190	240	320

4 CPU allocated

Type	Name	Request Count	Failure Count	Median Response Time	Average Response Time	Min Response Time	Max Response Time	Average Content Size	Requests/s	Failures/s	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
grpc	query_metrics	473686	0	44	50.3294800098238	20.29696999670710	382.6167549996170	2.919239749538720	1563.3277452262800	0.0	44	53	60	65	81	100	130	140	190	250	380
	Aggregated	473686	0	44	50.3294800098238	20.29696999670710	382.6167549996170	2.919239749538720	1563.3277452262800	0.0	44	53	60	65	81	100	130	140	190	250	380

Please note, I have set GOMAXPROCS in each of these to the number of free cores.

While I am sort of amused at the performance characteristics here, I think it would be worthwhile to try to run this down, would you agree or does it not seem like a priority issue?

My concern is that we wouldn't want to scale our proxy only horizontally, that has negative implications in terms of open connections and the concurrency limits. e.g. 6 proxy nodes, we need to limit the concurrent query to 1/6 for each node (assuming we're still using the default ch concurrency of 100). If we continue to scale out, we would find ourselves limiting that per-proxy concurrency rule farther and might see less even query distribution than we get with a smaller set of proxy nodes.

Aug 04 '23 05:08 matthax

chproxy chproxy copied to clipboard

[BUG] CPU usage increased a lot since 1.17.0

chproxy
chproxy copied to clipboard