scylla-cdc-java Read each partition individually rather than using WHERE {set} IN query

This PR fixes cdc stream read query to read each stream individually by having Driver3Reader per stream.

Every single Driver3Reader now uses prepared statement which is where equals ? not where in ?

https://github.com/scylladb/scylla-cdc-source-connector/issues/15

Nov 22 '21 08:11 0xF0D0

@0xF0D0 did you check that this actually improves the performance?

Are you working with @pkgonan ?

Dec 01 '21 10:12 haaawk

Yep I am :) I'll upload the result of test in a couple of days

Dec 07 '21 08:12 0xF0D0

Sorry for the late reply. We've done some tests and proved that it reduces latency drastically.

current_slow_log.txt

improved_slow_log.txt

Dec 22 '21 07:12 0xF0D0

First image is when I connected with current scylla driver Second image is when I connected this pr's driver

Jan 03 '22 10:01 0xF0D0

Please squash the commits so that in the history you don't fix errors in commits that you've introduced just before.

Jan 10 '22 13:01 haaawk

@haaawk done

Jan 11 '22 05:01 0xF0D0

Thanks @0xF0D0

Jan 11 '22 07:01 haaawk

@haaawk is there any further action needed?

Jan 18 '22 08:01 0xF0D0

We're having problem to replicate your performance results @0xF0D0. Would you be open to have a video call with us? We would like to try to understand why we don't see the same results.

Jan 18 '22 11:01 haaawk

@0xF0D0 of course :) please send invite to [email protected]!

Jan 19 '22 07:01 0xF0D0

Is there problem sending invitation? I haven't got any mails... could you check it again?

FYI. My test env is

Cluster: i3en.3xlarge * 12, 2DC, 6 nodes per DC, on k8s

Stress nodes
- two write 100% nodes
- two mixed(ratio=read7,write1) nodes

CDC enabled on stress table.

Scylla Kafka Connector
- Kafka connect 3 nodes

You can check on default driver, Scylla connector consumes with latency over 100ms, but with this patch it reduces to under 10ms

Feb 20 '22 12:02 0xF0D0

I've sent an invite over the email @0xF0D0.

Feb 22 '22 11:02 haaawk

track

Jun 16 '22 03:06 hansh0801

@avelanarius ping

Jun 21 '22 17:06 fee-mendes

@avelanarius any chance to pick this up?

If it works better with IN queries on some clusters but better with individual queries on others, maybe the library should offer a config option to use one or the other?

Apr 27 '23 09:04 kbr-scylla

@avelanarius any chance to pick this up?

If it works better with IN queries on some clusters but better with individual queries on others, maybe the library should offer a config option to use one or the other?

How will the user know which value to use? Testing of their own workload against their own cluster (and its version) ?

Apr 27 '23 10:04 mykaul

How will the user know which value to use? Testing of their own workload against their own cluster (and its version) ?

I guess so. Apparently that's what @0xF0D0 did and they are using this PR on production for 6 months now (discussion on Slack).

Apr 27 '23 10:04 kbr-scylla

@kbr-scylla we are using in production about 6 months, and it works well. i can show detailed benchmark results and could discuss about it.

Apr 27 '23 10:04 hansh0801

@avelanarius - ping

Feb 25 '24 13:02 mykaul

scylla-cdc-java scylla-cdc-java copied to clipboard

Read each partition individually rather than using WHERE {set} IN query

scylla-cdc-java
scylla-cdc-java copied to clipboard