cql-proxy Performance issues under load

The cql-proxy sidecar is having trouble handling larger loads. We brought it up to our staging environment which has consistent loads at about an average of at about .25 million reads per hour and we noticed that our requests started timing out.

DB Retry for Get object has finally succeeded with 1 retries. No error on the query.
DB Retry last error: gocql: no response received from cassandra within timeout period
DB Retry for Objects exceeded retry count of 10.
DB Retry last error: gocql: no response received from cassandra within timeout period

Feb 10 '22 16:02 JensenAlexander

Thanks for trying it out. Couple questions:

What version of the proxy? (0.0.4?)
Are you using it against Astra or Cassandra?
".25 million reads per hour" is this a single proxy instance?
Approx. what size are the request and response payloads?

I think this might have been resolved by the following:

I've optimized parsing significantly (https://github.com/datastax/cql-proxy/pull/58). I'm seeing way less CPU usage under much higher load. Give me a couple days to quantify this. I'll try the request rate from your workload (versus 0.0.4).
I'm adding a new --num-conns parameter here: https://github.com/datastax/cql-proxy/pull/76. You might try bumping this up to --num-conns 2. This should be merged in the next couple days.

Give main a try and see if you see better results.

Feb 10 '22 20:02 mpenick

.25 million reads per hour

1e6*.25/3600 == ~69.4 request/s

Is that right? I've run much bigger workloads than this w/o issue. Maybe the request/response payloads are much bigger?

Feb 10 '22 20:02 mpenick

Yes, we are using 0.0.4
We're using it against Cassandra
The .25 million reads is on a single proxy instance
The size can vary from extremely small to megabyte streams
That math seems correct.

Feb 11 '22 17:02 JensenAlexander

Thanks for all the info. I'll try to reproduce and see what I can find.

The size can vary from extremely small to megabyte streams

This is something I have not tried yet.

Feb 14 '22 19:02 mpenick

I ran some tests using 1MB payloads (reads) @ ~70 request/s. Here are the results (%iles in microseconds):

direct (no proxy):

num_requests,   duration, final rate,       min,       mean,     median,       75th,       95th,       98th,       99th,     99.9th,        max
       10000,    142.864,    69.9966,      3528,      10942,      10751,      10855,      11751,      13399,      14695,      27023,      40863

cql-proxy (`--num-conns 1`, version https://github.com/datastax/cql-proxy/commit/7a02afdb6961242f0f6dd806c4cc407e93fb02c3)

num_requests,   duration, final rate,       min,       mean,     median,       75th,       95th,       98th,       99th,     99.9th,        max
       10000,    142.878,    69.9897,       203,      20232,      21535,      22367,      27279,      28079,      30095,      47935,      68799

cql-proxy (`--num-conns 2`, version https://github.com/datastax/cql-proxy/commit/7a02afdb6961242f0f6dd806c4cc407e93fb02c3)

num_requests,   duration, final rate,       min,       mean,     median,       75th,       95th,       98th,       99th,     99.9th,        max
       10000,    142.875,    69.9914,       159,      20364,      21647,      23023,      27519,      28575,      30783,      47327,     270591

Raw data: https://gist.github.com/mpenick/256c313c4f075315c9a07cdf2bddc7c5

Feb 14 '22 22:02 mpenick

I'm currently running a 1hr test @ ~70 request/s using cql-proxy. I'll post the result when done.

Results:

num_requests,   duration, final rate,       min,       mean,     median,       75th,       95th,       98th,       99th,     99.9th,        max
      250000,    3571.72,    69.9943,       232,      20476,      21487,      22655,      27215,      28191,      30111,      50175,     236543

Raw data: https://gist.github.com/mpenick/75bdd66e2699e6564e87d0e6140154a1

Feb 14 '22 22:02 mpenick

I noticed this: https://github.com/datastax/cql-proxy/commit/ea8421bab88bb6b85c6c22a1e1f010f5f955b026, Seeing a huge difference in performance (from reduced system calls). Give that a try, it's now on main. I'll also be pushing a release tomorrow.

Feb 15 '22 00:02 mpenick

We just tried v0.1.1 and we're seeing the same issue.

Feb 23 '22 16:02 JensenAlexander

It was initially spurts now exactly (within a minute) of every 20 minutes. 10:21, 10:51, 11:11, but then it became fairly constant and occasional peaks of over 1000 failures within two minutes.

Feb 23 '22 17:02 JensenAlexander

@mpenick Do you have any update/suggestions?

Mar 04 '22 16:03 JensenAlexander

@mpenick Do you have any update/suggestions?

I haven't been able to reproduce the issue yet so it's hard for me to know how to proceed here. I've tried to reproduce in tests above.

What's the request timeout setting you're using for gocql?

Mar 29 '22 19:03 mpenick

Thinking about this a bit more, maybe it makes sense to add some initial metrics to cql-proxy. We could then see where the extra latency is coming from by pointing a Grafana instance at it.

Mar 29 '22 19:03 mpenick

Performance issues under load

direct (no proxy):

cql-proxy (--num-conns 1, version https://github.com/datastax/cql-proxy/commit/7a02afdb6961242f0f6dd806c4cc407e93fb02c3)

cql-proxy (--num-conns 2, version https://github.com/datastax/cql-proxy/commit/7a02afdb6961242f0f6dd806c4cc407e93fb02c3)

cql-proxy (`--num-conns 1`, version https://github.com/datastax/cql-proxy/commit/7a02afdb6961242f0f6dd806c4cc407e93fb02c3)

cql-proxy (`--num-conns 2`, version https://github.com/datastax/cql-proxy/commit/7a02afdb6961242f0f6dd806c4cc407e93fb02c3)