4.x: Stabilize AdvancedShardAwarenessIT
On recent Scylla versions this test started failing periodically. It looks like with newer Scylla the driver somehow hits a scenario where it successfully initializes a good portion of the connections, then all connection attempts to one of the nodes get rejected. It is accompanied by multiple erros like this:
19:38:41.582 [s0-admin-1] WARN c.d.o.d.i.core.pool.ChannelPool - [s0|/127.0.2.2:19042] Error while opening new channel
com.datastax.oss.driver.api.core.DriverTimeoutException: [s0|id: 0xfc42b7c7, L:/127.0.0.1:11854 - R:/127.0.2.2:19042] Protocol initialization request, step 1 (OPTIONS): timed out after 5000 ms
at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.onTimeout(ChannelHandlerRequest.java:110)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:160)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
Increasing delays between reconnections or even increasing the test timeout (largest value tried was 40 seconds) does not help with this situation. The node logs do not show anything raising suspicion. Not even a WARN.
This change lowers the number of nodes to 1 (previously 2) and the number
of expected channels per session to 33 (previously 66) in resource heavy
test methods. Number of sessions remains at 4.
The reconnection delays in should_not_struggle_to_fill_pools will now
start at around 300ms and should not rise above 3200ms.
This is the smallest tested set of changes that seems to resolve the issue.
The test remains meaningful since should_struggle_to_fill_pools still
displays considerably worse performance without adv. shard awareness.
Seems like somehow some logs leak between methods. I've got to fix that too. Also channel pool can be null sometimes when checking the await's condition.
~~Some of the reconnections come from RandomTokenVnodesIT which should not be running at all, since from what I see it does not have annotation with scylla backend and also has ScyllaSkip annotation.~~
This was incorrect, the reconnections are for the cluster created for the next test which is DefaultMetadataTabletMapIT.
RandomTokenVnodesIT startet taking a few seconds more even though its marked for skipping, so there was a slight change there too, but that does not make it the suspect here yet.
Looks green now. Created #678 to follow up on root cause.
@Bouncheck , I have couple of questions:
- Why and when exactly it started failing, it was not like that two weeks ago ?
- Why it is not failing on
6.1.5and failing on other scylla versions, what is the difference ?
@Bouncheck , I have couple of questions:
1. Why and when exactly it started failing, it was not like that two weeks ago ?
Why is currently unclear. First sighting seems to be github actions run after pushing this https://github.com/scylladb/java-driver/commit/ab2665fe1fc5badb47827153be63ba3d78e716fe However the final run visible under PR does not have the same failures https://github.com/scylladb/java-driver/actions/runs/17986407382 I don't see anything significant in this commit that would explain the failures right now.
2. Why it is not failing on `6.1.5` and failing on other scylla versions, what is the difference ?
Also currently unclear. It could be something on the server side. One common thread i see between the failing runs is that there are sessions that try to communicate with the cluster created for DefaultMetadataTabletMapIT which is long gone. It could be a mix of something incorrect in the test code that surfaced due to some change on the server side.
Those extra sessions and reconnections cause additional matches to appear in the logs, but they are unrelated to adv. shard awareness test. They also could be making port collisions or timeouts slightly more likely.
Before merging this let's evaluate #682 . I think switching from hardcoded sleeps to awaits and using more specific patterns are still worthwhile changes, but maybe i should not be increasing number of reconnections tolerance here.
@Bouncheck , please rebase it
Rebased and reworked this PR. Updated the PR description to reflect the current changes. Issues outside of adjusting test configuration were addressed in #737
@dkropachev Do you think we can merge this PR?
@dkropachev Do you think we can merge this PR?
As I mentioned before I don't think we need to merge it since it most likely hides an actual issue, we better investigate failures and find why it is happening.