Performance issues under load
The cql-proxy sidecar is having trouble handling larger loads. We brought it up to our staging environment which has consistent loads at about an average of at about .25 million reads per hour and we noticed that our requests started timing out.
DB Retry for Get object has finally succeeded with 1 retries. No error on the query.
DB Retry last error: gocql: no response received from cassandra within timeout period
DB Retry for Objects exceeded retry count of 10.
DB Retry last error: gocql: no response received from cassandra within timeout period
Thanks for trying it out. Couple questions:
- What version of the proxy? (
0.0.4?) - Are you using it against Astra or Cassandra?
- ".25 million reads per hour" is this a single proxy instance?
- Approx. what size are the request and response payloads?
I think this might have been resolved by the following:
- I've optimized parsing significantly (https://github.com/datastax/cql-proxy/pull/58). I'm seeing way less CPU usage under much higher load. Give me a couple days to quantify this. I'll try the request rate from your workload (versus 0.0.4).
- I'm adding a new
--num-connsparameter here: https://github.com/datastax/cql-proxy/pull/76. You might try bumping this up to--num-conns 2. This should be merged in the next couple days.
Give main a try and see if you see better results.
.25 million reads per hour
1e6*.25/3600 == ~69.4 request/s
Is that right? I've run much bigger workloads than this w/o issue. Maybe the request/response payloads are much bigger?
- Yes, we are using
0.0.4 - We're using it against Cassandra
- The .25 million reads is on a single proxy instance
- The size can vary from extremely small to megabyte streams
- That math seems correct.
Thanks for all the info. I'll try to reproduce and see what I can find.
The size can vary from extremely small to megabyte streams
This is something I have not tried yet.
I ran some tests using 1MB payloads (reads) @ ~70 request/s. Here are the results (%iles in microseconds):
direct (no proxy):
num_requests, duration, final rate, min, mean, median, 75th, 95th, 98th, 99th, 99.9th, max
10000, 142.864, 69.9966, 3528, 10942, 10751, 10855, 11751, 13399, 14695, 27023, 40863
cql-proxy (--num-conns 1, version https://github.com/datastax/cql-proxy/commit/7a02afdb6961242f0f6dd806c4cc407e93fb02c3)
num_requests, duration, final rate, min, mean, median, 75th, 95th, 98th, 99th, 99.9th, max
10000, 142.878, 69.9897, 203, 20232, 21535, 22367, 27279, 28079, 30095, 47935, 68799
cql-proxy (--num-conns 2, version https://github.com/datastax/cql-proxy/commit/7a02afdb6961242f0f6dd806c4cc407e93fb02c3)
num_requests, duration, final rate, min, mean, median, 75th, 95th, 98th, 99th, 99.9th, max
10000, 142.875, 69.9914, 159, 20364, 21647, 23023, 27519, 28575, 30783, 47327, 270591
Raw data: https://gist.github.com/mpenick/256c313c4f075315c9a07cdf2bddc7c5
I'm currently running a 1hr test @ ~70 request/s using cql-proxy. I'll post the result when done.
Results:
num_requests, duration, final rate, min, mean, median, 75th, 95th, 98th, 99th, 99.9th, max
250000, 3571.72, 69.9943, 232, 20476, 21487, 22655, 27215, 28191, 30111, 50175, 236543
Raw data: https://gist.github.com/mpenick/75bdd66e2699e6564e87d0e6140154a1
I noticed this: https://github.com/datastax/cql-proxy/commit/ea8421bab88bb6b85c6c22a1e1f010f5f955b026, Seeing a huge difference in performance (from reduced system calls). Give that a try, it's now on main. I'll also be pushing a release tomorrow.
We just tried v0.1.1 and we're seeing the same issue.
It was initially spurts now exactly (within a minute) of every 20 minutes. 10:21, 10:51, 11:11, but then it became fairly constant and occasional peaks of over 1000 failures within two minutes.
@mpenick Do you have any update/suggestions?
@mpenick Do you have any update/suggestions?
I haven't been able to reproduce the issue yet so it's hard for me to know how to proceed here. I've tried to reproduce in tests above.
What's the request timeout setting you're using for gocql?
Thinking about this a bit more, maybe it makes sense to add some initial metrics to cql-proxy. We could then see where the extra latency is coming from by pointing a Grafana instance at it.