java-driver icon indicating copy to clipboard operation
java-driver copied to clipboard

after entire cluster was replaced(decommission->add new node) with new nodes c-s continue to use an old node that was provided in "-host" parameter

Open aleksbykov opened this issue 1 year ago • 5 comments

Test runs two operations: add new node, decommission random node. Cluster started from 3 nodes. Then on each iteration new node was added, and one random node decommissioned. All operations went fine, while the latest node from initial cluster's nodes - node1 - was starting decommissioned, cassandra-stress terminated with errors:

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.3.0.133:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/10.3.0.133:9042] Write attempt on defunct connection), ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.3.0.133:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/10.3.0.133:9042] Write attempt on defunct connection), ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))
        at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.3.0.133:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/10.3.0.133:9042] Write attempt on defunct connection), ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))
java.io.IOException: Operation x10 on key(s) [395038363034324c4b30]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (tried: ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.3.0.133:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/10.3.0.133:9042] Write attempt on defunct connection), ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))

com.datastax.driver.core.exceptions.TransportException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Error writing
        at org.apache.cassandra.stress.Operation.error(Operation.java:141)
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))
        at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))
        at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))
        at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))
java.io.IOException: Operation x10 on key(s) [34354f324c3450503130]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (tried: /10.3.0.133:9042 (com.datastax.driver.core.exceptions.ConnectionException: [/10.3.0.133:9042] Write attempt on defunct connection))
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionException: [ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042] Write attempt on defunct connection))

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ip-10-3-1-72.eu-west-2.compute.internal/10.3.1.72:9042 (com.datastax.driver.core.exceptions.ConnectionExce

The issue happend periodically (not very often) and attempt to reproduce or get exact steps were finished without sucess.

There are couple thoughts why it could happened:

  1. https://github.com/scylladb/scylladb/issues/15803#issuecomment-1807240060
  2. https://github.com/scylladb/scylladb/issues/15803#issuecomment-1808115818
  3. https://github.com/scylladb/scylladb/issues/15803#issuecomment-1808121072

More details could found in issue: https://github.com/scylladb/scylladb/issues/15803

Scylla version (or git commit hash): 5.5.0~dev-20231108.a4aeef2eb0aa

aleksbykov avatar Nov 16 '23 05:11 aleksbykov

Run into it also on testing of 2024.1

this is the driver version used: scylla-driver-core-3.11.5.0-shaded.jar

in this run it can clear show the driver is problematic, since one loader-1 is working o.k. and loader-2 and loader-3 are get at some point to:

2023-12-26 11:45:29.708: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=8c451dcf-055a-4ffa-b032-c4233b342930 during_nemesis=RunUniqueSequence: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=86310 node=Node longevity-tls-50gb-3d-2024-1-loader-node-a1f8085f-2 [34.220.19.246 | 10.15.2.132] (seed: False)
java.io.IOException: Operation x10 on key(s) [4f4b4b36364d50333330]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (no host was tried)

cassandra-stress-mixed-l0-c0-k1-d63a4c38-f447-400b-968c-bb8edd3e4b40.log is the o.k. log from loader-1 cassandra-stress-mixed-l1-c0-k1-0ade02f4-10c6-4a06-a083-355e1030d932.log is the problematic log from loader-2

we can see node longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-18 [10.15.1.49], that is shown on loader-1

WARN  10:07:34,987 Not using advanced port-based shard awareness with /10.15.1.49:9042 because we're missing port-based shard awareness port on the server

is not shown on loader-2 and loader-3 logs

Installation details

Kernel Version: 5.15.0-1051-aws Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-9 (52.38.56.231 | 10.15.1.184) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-8 (54.184.245.231 | 10.15.0.94) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-7 (35.91.177.32 | 10.15.3.202) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-6 (54.188.193.5 | 10.15.0.62) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-5 (35.93.117.24 | 10.15.2.158) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-4 (34.221.12.61 | 10.15.1.91) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-3 (54.202.147.61 | 10.15.0.183) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-21 (34.221.99.255 | 10.15.2.179) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-20 (54.71.94.242 | 10.15.3.43) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-2 (34.219.206.169 | 10.15.2.21) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-19 (52.12.211.216 | 10.15.0.184) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-18 (54.213.52.236 | 10.15.1.49) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-17 (35.89.150.163 | 10.15.3.60) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-16 (34.213.245.39 | 10.15.0.94) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-15 (54.212.78.118 | 10.15.0.4) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-14 (52.13.53.69 | 10.15.0.159) (shards: -1)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-13 (35.163.200.60 | 10.15.0.172) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-12 (35.87.202.22 | 10.15.3.10) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-11 (34.217.48.78 | 10.15.2.28) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-10 (35.163.61.87 | 10.15.3.200) (shards: 14)
  • longevity-tls-50gb-3d-2024-1-db-node-a1f8085f-1 (34.219.243.15 | 10.15.2.48) (shards: 14)

OS / Image: ami-0de15c927866f9ebe (aws: undefined_region)

Test: longevity-50gb-3days-test Test id: a1f8085f-d941-4e43-95a0-ffd6f8beea66 Test name: enterprise-2024.1/longevity/longevity-50gb-3days-test Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor a1f8085f-d941-4e43-95a0-ffd6f8beea66
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs a1f8085f-d941-4e43-95a0-ffd6f8beea66

Logs:

Jenkins job URL Argus

fruch avatar Dec 26 '23 21:12 fruch

@roydahan this was open more than a month ago, but no one looked it.

fruch avatar Dec 26 '23 21:12 fruch

@Bouncheck can you please check this out?

roydahan avatar Dec 26 '23 23:12 roydahan

@avelanarius we need someone to look at it, this is failing in 2024.1 and we need to know why and how severe it is.

roydahan avatar Jan 01 '24 10:01 roydahan

@Bouncheck can you please check this out?

Sorry, I was out of office before. I'll check this out.

Bouncheck avatar Jan 03 '24 00:01 Bouncheck