python-driver icon indicating copy to clipboard operation
python-driver copied to clipboard

Driver reported "[Errno 9] Bad file descriptor"

Open timtimb0t opened this issue 2 months ago • 9 comments

Argus

Scylla version: 2026.1.0~dev-20251205.866c96f536b0 with build-id 2c38506085b888e1baa43f81d05dab12df5132c1

During latest master runs driver reported following error:

< t:2025-12-06 04:21:39,017 f:cluster.py      l:3723 c:cassandra.cluster    p:WARNING > [control connection] Error connecting to 10.12.33.86:9042: < t:2025-12-06 04:21:39,017 f:cluster.py      l:3723 c:cassandra.cluster    p:WARNING > [control connection] Error connecting to 10.12.33.86:9042:
< t:2025-12-06 04:21:39,017 f:cluster.py      l:3723 c:cassandra.cluster    p:WARNING > Traceback (most recent call last):
< t:2025-12-06 04:21:39,017 f:cluster.py      l:3723 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3546, in cassandra.cluster.ControlConnection._connect_host_in_lbp
< t:2025-12-06 04:21:39,017 f:cluster.py      l:3723 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3662, in cassandra.cluster.ControlConnection._try_connect
< t:2025-12-06 04:21:39,017 f:cluster.py      l:3723 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3646, in cassandra.cluster.ControlConnection._try_connect
< t:2025-12-06 04:21:39,017 f:cluster.py      l:3723 c:cassandra.cluster    p:WARNING > cassandra.connection.ConnectionShutdown: [Errno 9] Bad file descriptor
< t:2025-12-06 04:21:39,017 f:cluster.py      l:3723 c:cassandra.cluster    p:WARNING > Host 10.12.33.86:9042 has been marked down

It seems that such errors appeared each time while one of nodes been down

Also been spotted there: https://argus.scylladb.com/tests/scylla-cluster-tests/a8cd6873-19c1-49c1-ab5a-dca25655ed6c

Kernel Version: 6.14.0-1017-aws

Extra information

Installation details

Cluster size: 6 nodes (i7i.4xlarge)

Scylla Nodes used in this run:

- longevity-tls-50gb-3d-master-db-node-38f90182-1 (3.228.203.95 | 10.12.35.225) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-38f90182-2 (44.213.201.240 | 10.12.32.118) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-38f90182-3 (100.30.78.169 | 10.12.33.86) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-38f90182-4 (44.207.141.103 | 10.12.35.59) (shards: -1)


- longevity-tls-50gb-3d-master-db-node-38f90182-5 (3.219.68.68 | 10.12.35.232) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-38f90182-6 (34.199.164.159 | 10.12.33.159) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-38f90182-7 (98.82.213.102 | 10.12.33.203) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-38f90182-8 (98.83.182.28 | 10.12.34.200) (shards: 14)

OS / Image: ami-0810c73586fe68036 (aws: N/A)

Test: longevity-50gb-3days-test Test id: 38f90182-547d-4b60-973c-7e826b926708 Test name: scylla-master/tier1/longevity-50gb-3days-test

Test method: longevity_test.LongevityTest.test_custom_time

Test config file(s):

Logs:

Jenkins job URL

timtimb0t avatar Dec 08 '25 16:12 timtimb0t

Hi @dkropachev , could you please take a look at this issue?

timtimb0t avatar Dec 08 '25 16:12 timtimb0t

reproduced again:

Argus

Scylla version: 2026.1.0~dev-20251211.f7ffa395a8fd with build-id 6ed9dbb170d6894329ed88a93e118dd68cbd62a9

Kernel Version: 6.14.0-1018-aws

Extra information

Installation details

Cluster size: 6 nodes (i7i.2xlarge)

Scylla Nodes used in this run:

- longevity-50gb-12h-master-db-node-70283809-1 (13.218.127.161 | 10.12.9.87) (shards: 4)


- longevity-50gb-12h-master-db-node-70283809-2 (54.91.187.43 | 10.12.8.66) (shards: 4)


- longevity-50gb-12h-master-db-node-70283809-3 (18.212.86.250 | 10.12.8.228) (shards: 6)


- longevity-50gb-12h-master-db-node-70283809-4 (54.92.211.214 | 10.12.10.198) (shards: 6)


- longevity-50gb-12h-master-db-node-70283809-5 (98.84.134.241 | 10.12.8.104) (shards: 5)


- longevity-50gb-12h-master-db-node-70283809-6 (54.160.211.56 | 10.12.9.171) (shards: 4)

OS / Image: ami-02ad235f4c4336f6c (aws: N/A)

Test: longevity-150gb-asymmetric-cluster-12h-test Test id: 70283809-37aa-4be5-9ebc-d891e1a2d6aa Test name: scylla-master/tier1/longevity-150gb-asymmetric-cluster-12h-test

Test method: longevity_test.LongevityTest.test_custom_time

Test config file(s):

Logs:

Jenkins job URL

timtimb0t avatar Dec 15 '25 10:12 timtimb0t

Argus

Scylla version: 2026.1.0~dev-20251219.f65db4e8eba5 with build-id 683ff5b7a4a313ea6094e72fd639c906693ece37

Kernel Version: 6.14.0-1018-aws

Extra information

Installation details

Cluster size: 6 nodes (i7i.4xlarge)

Scylla Nodes used in this run:

- longevity-tls-50gb-3d-master-db-node-c6beb17a-1 (98.87.193.30 | 10.12.35.220) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-2 (52.6.69.201 | 10.12.34.173) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-3 (100.49.143.61 | 10.12.32.22) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-4 (52.203.20.179 | 10.12.34.56) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-5 (44.193.182.223 | 10.12.32.49) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-6 (50.17.245.62 | 10.12.34.86) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-7 (44.209.62.120 | 10.12.32.166) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-8 (54.152.201.38 | 10.12.33.224) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-9 (98.95.22.145 | 10.12.33.230) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-10 (3.231.75.179 | 10.12.32.60) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-11 (100.49.20.125 | 10.12.33.136) (shards: -1)


- longevity-tls-50gb-3d-master-db-node-c6beb17a-12 (3.215.138.198 | 10.12.34.73) (shards: 14)

OS / Image: ami-048249cf3c5bfc84f (aws: N/A)

Test: longevity-50gb-3days-test Test id: c6beb17a-d0b9-43b6-ad05-2fbd45c4201d Test name: scylla-master/tier1/longevity-50gb-3days-test

Test method: longevity_test.LongevityTest.test_custom_time

Test config file(s):

Logs:

Jenkins job URL

timtimb0t avatar Dec 22 '25 11:12 timtimb0t

Argus

Scylla version: 2026.1.0~dev-20251219.f65db4e8eba5 with build-id 683ff5b7a4a313ea6094e72fd639c906693ece37

Kernel Version: 6.14.0-1018-aws

Extra information

Installation details

Cluster size: 6 nodes (i7i.2xlarge)

Scylla Nodes used in this run:

- longevity-50gb-12h-master-db-node-e2d8a05c-1 (34.201.94.66 | 10.12.8.210) (shards: 6)


- longevity-50gb-12h-master-db-node-e2d8a05c-2 (18.234.51.11 | 10.12.8.248) (shards: 6)


- longevity-50gb-12h-master-db-node-e2d8a05c-3 (52.54.112.48 | 10.12.10.136) (shards: 5)


- longevity-50gb-12h-master-db-node-e2d8a05c-4 (13.222.190.127 | 10.12.11.124) (shards: 7)


- longevity-50gb-12h-master-db-node-e2d8a05c-5 (54.145.225.121 | 10.12.10.222) (shards: 7)


- longevity-50gb-12h-master-db-node-e2d8a05c-6 (18.208.221.26 | 10.12.8.44) (shards: 4)

OS / Image: ami-048249cf3c5bfc84f (aws: N/A)

Test: longevity-150gb-asymmetric-cluster-12h-test Test id: e2d8a05c-55b0-4025-b3bb-00712401b844 Test name: scylla-master/tier1/longevity-150gb-asymmetric-cluster-12h-test

Test method: longevity_test.LongevityTest.test_custom_time

Test config file(s):

Logs:

Jenkins job URL

timtimb0t avatar Dec 22 '25 11:12 timtimb0t

https://argus.scylladb.com/tests/scylla-cluster-tests/ccf51876-4d31-48dd-b266-0b83cca6c8fb https://argus.scylladb.com/tests/scylla-cluster-tests/85780def-b7ac-406a-b801-6608dab8a5d3 reproduced

timtimb0t avatar Dec 29 '25 10:12 timtimb0t

The following is happening:

  1. Force close connection by some reason
  2. In parallel to that other parts of the driver either read from the connection or write to it, since socket got closed any operation on it ends up in Bad file descriptor.

Unforrtunately the way driver handles this case make initial reason driver closed connection parish in time we can pick it up only from logs. So, either we pick it up from the logs or we need to add some code to persist reason why connection was closed and throw a proper message when socket operation failed.

dkropachev avatar Dec 29 '25 12:12 dkropachev

The following is happening:

  1. Force close connection by some reason
  2. In parallel to that other parts of the driver either read from the connection or write to it, since socket got closed any operation on it ends up in Bad file descriptor.

Unforrtunately the way driver handles this case make initial reason driver closed connection parish in time we can pick it up only from logs. So, either we pick it up from the logs or we need to add some code to persist reason why connection was closed and throw a proper message when socket operation failed.

what is surfacing it now, it's doesn't sounds like a new flow in the driver ? python 3.14 ?

fruch avatar Dec 29 '25 15:12 fruch

The following is happening:

  1. Force close connection by some reason
  2. In parallel to that other parts of the driver either read from the connection or write to it, since socket got closed any operation on it ends up in Bad file descriptor.

Unforrtunately the way driver handles this case make initial reason driver closed connection parish in time we can pick it up only from logs. So, either we pick it up from the logs or we need to add some code to persist reason why connection was closed and throw a proper message when socket operation failed.

what is surfacing it now, it's doesn't sounds like a new flow in the driver ? python 3.14 ?

Absolutely not, I don't think that it is a python 3.14 issue, we need to dig into it to come up with decent clues.

dkropachev avatar Dec 29 '25 16:12 dkropachev

Reproduced, during disrupt_serial_restart_elected_topology_coordinator and disrupt_kill_mv_building_coordinator

Argus

Scylla version: 2026.1.0~dev-20260101.6c8ddfc018df with build-id a6c13b1f1c32f12209df2d88746d46ad87d6a234

Kernel Version: 6.14.0-1018-aws

Extra information

Installation details

Cluster size: 6 nodes (i7i.4xlarge)

Scylla Nodes used in this run:

- longevity-tls-50gb-3d-master-db-node-ebcdbea0-1 (3.94.50.66 | 10.12.34.81) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-2 (98.80.16.221 | 10.12.32.59) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-3 (44.209.1.35 | 10.12.35.218) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-4 (100.28.64.36 | 10.12.34.248) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-5 (100.50.211.251 | 10.12.34.30) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-6 (52.200.42.48 | 10.12.35.28) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-7 (34.225.99.38 | 10.12.32.159) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-8 (3.222.145.143 | 10.12.32.146) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-9 (98.89.139.56 | 10.12.35.173) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-10 (52.86.61.173 | 10.12.34.55) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-11 (3.219.7.100 | 10.12.35.29) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-12 (54.235.70.200 | 10.12.32.235) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-13 (18.211.116.191 | 10.12.35.96) (shards: -1)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-14 (100.30.149.227 | 10.12.34.14) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-15 (184.73.237.224 | 10.12.35.88) (shards: 14)


- longevity-tls-50gb-3d-master-db-node-ebcdbea0-16 (54.164.126.255 | 10.12.35.31) (shards: 14)

OS / Image: ami-06471fb71c6e86b19 (aws: N/A)

Test: longevity-50gb-3days-test Test id: ebcdbea0-bc81-4521-b136-57391821385d Test name: scylla-master/tier1/longevity-50gb-3days-test

Test method: longevity_test.LongevityTest.test_custom_time

Test config file(s):

Logs:

Jenkins job URL

cezarmoise avatar Jan 05 '26 15:01 cezarmoise

Argus

Scylla version: 2026.1.0~rc0-20260125.f94296e0ae43 with build-id 9680213fda6f301234c43da8ca27e47953987cd8

Kernel Version: 6.14.0-1018-aws

Extra information

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

- longevity-100gb-4h-2026-1-db-node-ae55afa4-1 (18.214.100.191 | 10.12.8.254) (shards: 14)


- longevity-100gb-4h-2026-1-db-node-ae55afa4-2 (98.93.132.101 | 10.12.11.251) (shards: 14)


- longevity-100gb-4h-2026-1-db-node-ae55afa4-3 (100.31.91.66 | 10.12.9.119) (shards: 14)


- longevity-100gb-4h-2026-1-db-node-ae55afa4-4 (54.196.137.234 | 10.12.8.236) (shards: 14)


- longevity-100gb-4h-2026-1-db-node-ae55afa4-5 (54.90.78.63 | 10.12.10.173) (shards: 14)


- longevity-100gb-4h-2026-1-db-node-ae55afa4-6 (13.220.180.103 | 10.12.10.163) (shards: 14)


- longevity-100gb-4h-2026-1-db-node-ae55afa4-7 (34.224.86.8 | 10.12.8.121) (shards: 14)


- longevity-100gb-4h-2026-1-db-node-ae55afa4-8 (54.242.91.143 | 10.12.11.54) (shards: -1)

OS / Image: ami-041ecb6271ecc1499 (aws: N/A)

Test: longevity-100gb-4h-test Test id: ae55afa4-98cc-434a-8cfb-5d7738aba978 Test name: scylla-2026.1/longevity/longevity-100gb-4h-test

Test method: longevity_test.LongevityTest.test_custom_time

Test config file(s):

Logs:

Jenkins job URL

timtimb0t avatar Jan 27 '26 11:01 timtimb0t

@roydahan / @dkropachev please assign

bhalevy avatar Jan 29 '26 08:01 bhalevy

@roydahan / @dkropachev please assign

done

dkropachev avatar Jan 29 '26 18:01 dkropachev