scylla-cdc-java
scylla-cdc-java copied to clipboard
Read each partition individually rather than using WHERE {set} IN query
This PR fixes cdc stream read query to read each stream individually by having Driver3Reader per stream.
Every single Driver3Reader now uses prepared statement which is where equals ?
not where in ?
https://github.com/scylladb/scylla-cdc-source-connector/issues/15
@0xF0D0 did you check that this actually improves the performance?
Are you working with @pkgonan ?
Yep I am :) I'll upload the result of test in a couple of days
Sorry for the late reply. We've done some tests and proved that it reduces latency drastically.
First image is when I connected with current scylla driver Second image is when I connected this pr's driver
Please squash the commits so that in the history you don't fix errors in commits that you've introduced just before.
@haaawk done
Thanks @0xF0D0
@haaawk is there any further action needed?
We're having problem to replicate your performance results @0xF0D0. Would you be open to have a video call with us? We would like to try to understand why we don't see the same results.
@0xF0D0 of course :) please send invite to [email protected]!
Is there problem sending invitation? I haven't got any mails... could you check it again?
FYI. My test env is
Cluster: i3en.3xlarge * 12, 2DC, 6 nodes per DC, on k8s
Stress nodes
- two write 100% nodes
- two mixed(ratio=read7,write1) nodes
CDC enabled on stress table.
Scylla Kafka Connector
- Kafka connect 3 nodes
You can check on default driver, Scylla connector consumes with latency over 100ms, but with this patch it reduces to under 10ms
I've sent an invite over the email @0xF0D0.
track
@avelanarius ping
@avelanarius any chance to pick this up?
If it works better with IN
queries on some clusters but better with individual queries on others, maybe the library should offer a config option to use one or the other?
@avelanarius any chance to pick this up?
If it works better with
IN
queries on some clusters but better with individual queries on others, maybe the library should offer a config option to use one or the other?
How will the user know which value to use? Testing of their own workload against their own cluster (and its version) ?
How will the user know which value to use? Testing of their own workload against their own cluster (and its version) ?
I guess so. Apparently that's what @0xF0D0 did and they are using this PR on production for 6 months now (discussion on Slack).
@kbr-scylla we are using in production about 6 months, and it works well. i can show detailed benchmark results and could discuss about it.
@avelanarius - ping