python-driver icon indicating copy to clipboard operation
python-driver copied to clipboard

test_request_too_large[SSL] fails with NoHostAvailable

Open sylwiaszunejko opened this issue 2 months ago • 5 comments

It started happening since move to python 3.14 docker image, and it affects only the SSL variant of this test.

Refs: https://github.com/scylladb/scylladb/issues/27300

sylwiaszunejko avatar Dec 15 '25 14:12 sylwiaszunejko

From looking on detailed logs I see that there was empty system.peers results:

09:38:23,694 cassandra.cluster DEBUG dtest_setup.py :497  | [control connection] Removing host not found in peers metadata: <Host: 127.0.17.1:9042 datacenter1>
09:38:23,694 cassandra.cluster DEBUG dtest_setup.py :497  | [control connection] Finished fetching ring info
09:38:23,694 cassandra.cluster DEBUG dtest_setup.py :497  | [control connection] Rebuilding token map due to topology changes
09:38:23,698 cassandra.cluster DEBUG dtest_setup.py :497  | Control connection created
09:38:23,699 cassandra.pool  DEBUG thread.py      :73   | Initializing connection for host 127.0.17.1:9042
09:38:23,704 cassandra.connection DEBUG libevreactor.py:272  | Connection 140136824022800: '('127.0.0.1', 40786)' -> '('127.0.17.1', 9042)'
09:38:23,704 cassandra.connection DEBUG libevreactor.py:279  | Sending initial options message for new connection (140136824022800) to 127.0.17.1:9042
09:38:23,704 cassandra.protocol_features DEBUG protocol_features.py:62   | Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['0'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'SCYLLA_SHARD_AWARE_PORT': ['19042'], 'SCYLLA_USE_METADATA_ID': [''], 'TABLETS_ROUTING_V1': ['']}
09:38:23,704 cassandra.connection DEBUG libevreactor.py:376  | Received options response on new connection (140136824022800) from 127.0.17.1:9042
09:38:23,704 cassandra.connection ERROR libevreactor.py:376  | No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
09:38:23,704 cassandra.connection DEBUG libevreactor.py:376  | Sending StartupMessage on <LibevConnection(140136824022800) 127.0.17.1:9042>
09:38:23,704 cassandra.connection DEBUG libevreactor.py:376  | Sent StartupMessage on <LibevConnection(140136824022800) 127.0.17.1:9042>
09:38:23,705 cassandra.connection DEBUG libevreactor.py:376  | Got ReadyMessage on new connection (140136824022800) from 127.0.17.1:9042
09:38:23,705 cassandra.pool  DEBUG thread.py      :73   | First connection created to 127.0.17.1:9042 for shard_id=0
09:38:23,705 cassandra.pool  DEBUG thread.py      :73   | Finished initializing connection for host 127.0.17.1:9042
09:38:23,705 cassandra.cluster DEBUG thread.py      :73   | Added pool for host 127.0.17.1:9042 to session
09:38:23,705 cassandra.pool  DEBUG thread.py      :73   | shard_aware_endpoint=<DefaultEndPoint: 127.0.17.1:19042>
09:38:23,705 cassandra.cluster DEBUG dtest_setup.py :497  | Not starting MonitorReporter thread for Insights; not supported by server version 3.0.8 on ControlConnection host 127.0.17.1:9042
09:38:23,705 cassandra.cluster DEBUG dtest_setup.py :497  | Started Session with client_id df5bc982-cca4-48e9-8939-e24ddc034b46 and session_id c7e27a42-894c-4b1c-8529-fcd4ad07861a
09:38:23,706 cassandra.connection DEBUG libevreactor.py:272  | connection (140136824023120) port=62827 should be shard_id=1
09:38:23,706 limits_test     INFO  limits_test.py :199  | Trying to send a regular request to database (insert a string of regular size into table)...
09:38:23,712 cassandra.connection DEBUG libevreactor.py:272  | Connection 140136824023120: '('127.0.0.1', 62827)' -> '('127.0.17.1', 19042)'
09:38:23,713 cassandra.connection DEBUG libevreactor.py:279  | Sending initial options message for new connection (140136824023120) to 127.0.17.1:19042
09:38:23,716 cassandra.connection DEBUG libevreactor.py:333  | Defuncting connection (140136824022800) to 127.0.17.1:9042:
Traceback (most recent call last):
  File "/usr/local/lib/python3.14/site-packages/cassandra/io/libevreactor.py", line 324, in handle_write
    sent = self._socket.send(next_msg)
  File "/usr/local/lib/python3.14/ssl.py", line 1232, in send
    return self._sslobj.write(data)
           ~~~~~~~~~~~~~~~~~~^^^^^^
BrokenPipeError: [Errno 32] Broken pipe
09:38:23,716 cassandra.io.libevreactor DEBUG libevreactor.py:292  | Closing connection (140136824022800) to 127.0.17.1:9042
09:38:23,717 cassandra.io.libevreactor DEBUG libevreactor.py:296  | Closed socket to 127.0.17.1:9042
09:38:23,717 cassandra.pool  DEBUG libevreactor.py:333  | Defunct or closed connection (140136824022800) returned to pool, potentially marking host 127.0.17.1:9042 as down
09:38:23,717 cassandra.pool  DEBUG libevreactor.py:333  | Shutting down connections to 127.0.17.1:9042
09:38:23,717 cassandra.pool  DEBUG libevreactor.py:333  | Closing connection (140136824022800) to 127.0.17.1:9042
09:38:23,717 cassandra.pool  DEBUG libevreactor.py:333  | Closing pending connection (140136824023120) to 127.0.17.1:9042
09:38:23,717 cassandra.io.libevreactor DEBUG libevreactor.py:292  | Closing connection (140136824023120) to 127.0.17.1:19042
09:38:23,717 cassandra.io.libevreactor DEBUG libevreactor.py:296  | Closed socket to 127.0.17.1:19042
09:38:23,717 cassandra.cluster WARNING libevreactor.py:333  | Host 127.0.17.1:9042 has been marked down
09:38:23,717 cassandra.pool  DEBUG thread.py      :73   | Received a connection 140136824023120 for shard_id=1 on host 127.0.17.1:9042
09:38:23,717 cassandra.cluster DEBUG thread.py      :73   | [control connection] Control connection host (127.0.17.1:9042) is considered down, starting reconnection
09:38:23,717 cassandra.pool  DEBUG thread.py      :73   | Pool for host 127.0.17.1:9042 is in shutdown, closing the new connection (140136824023120)
---------------------------- Captured log teardown -----------------------------
09:38:23,761 cassandra.cluster DEBUG conftest.py    :322  | Shutting down Cluster Scheduler
09:38:23,785 cassandra.cluster DEBUG conftest.py    :322  | Shutting down control connection
09:38:23,785 cassandra.io.libevreactor DEBUG libevreactor.py:292  | Closing connection (140136824018640) to 127.0.17.1:9042
09:38:23,785 cassandra.io.libevreactor DEBUG libevreactor.py:296  | Closed socket to 127.0.17.1:9042
09:38:23,786 cassandra.cluster DEBUG conftest.py    :322  | Shutting down Cluster Scheduler
09:38:23,819 cassandra.cluster DEBUG conftest.py    :322  | Shutting down control connection
09:38:23,819 cassandra.io.libevreactor DEBUG libevreactor.py:292  | Closing connection (140136824026320) to 127.0.17.1:9042
09:38:23,820 cassandra.io.libevreactor DEBUG libevreactor.py:296  | Closed socket to 127.0.17.1:9042
09:38:26,825 dtest_setup     DEBUG dtest_setup.py :630  | exclude_errors: []
09:38:26,835 dtest_setup     DEBUG dtest_setup.py :729  | removing ccm cluster test at: /home/sylwiaszunejko/.dtest/dtest-v0lgeo3u
09:38:26,844 dtest_setup     DEBUG dtest_setup.py :732  | clearing ssl stores from [/home/sylwiaszunejko/.dtest/dtest-v0lgeo3u] directory
09:38:26,845 dtest_setup     DEBUG dtest_setup.py :85   | Freeing cluster ID 17: link /home/sylwiaszunejko/.dtest/17

sylwiaszunejko avatar Dec 16 '25 11:12 sylwiaszunejko

@sylwiaszunejko , please update master issue - https://github.com/scylladb/python-driver/issues/619, not here

dkropachev avatar Dec 16 '25 11:12 dkropachev

it is also inline with what I observed in https://github.com/scylladb/python-driver/issues/616#issuecomment-3642916203

dkropachev avatar Dec 16 '25 11:12 dkropachev

Actually I was to quick with jumping to the conclusion about this line, it is expected in this case due to the logic in python-driver that first when we resolving contact points it gets random host_id and then when we get proper id from the server the duplicate with wrong id needs to be removed

sylwiaszunejko avatar Dec 16 '25 12:12 sylwiaszunejko

I ran into this a lot, like https://argus.scylladb.com/tests/scylla-cluster-tests/4db162e9-ee42-4fca-999e-74c6a6a31166 (last 5 runs) and https://argus.scylladb.com/tests/scylla-cluster-tests/021d804d-6f55-4cbe-87c0-cc5a64e04973 (last 2 runs)

Command: '/usr/bin/cqlsh --no-color   --request-timeout=120 --connect-timeout=60 --ssl -e "describe keyspaces" 10.12.11.25'
Exit code: 1
Stdout:
Stderr:
Connection error: ('Unable to connect to any servers', {'10.12.11.25:9042': OperationTimedOut('errors=Timed out creating connection (60 seconds), last_host=None')})

But a quick test (https://argus.scylladb.com/tests/scylla-cluster-tests/e81a636c-1d78-4771-88fa-40c870f00dce) with configurations/disable_client_encrypt.yaml which make the commands not have the --ssl argument, the issue no longer happens.

< t:2025-12-18 15:49:52,312 f:remote_base.py  l:650  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.3.94>: Running command "/usr/bin/cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.12.3.94"...
< t:2025-12-18 15:49:54,815 f:base.py         l:255  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.3.94>: 
< t:2025-12-18 15:49:54,815 f:base.py         l:255  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.3.94>: keyspace1  system_auth         system_distributed_everywhere  system_schema
< t:2025-12-18 15:49:54,815 f:base.py         l:255  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.3.94>: system     system_distributed  system_replicated_keys         system_traces
< t:2025-12-18 15:49:54,815 f:base.py         l:255  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.3.94>: 
< t:2025-12-18 15:49:54,816 f:base.py         l:153  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.12.3.94>: Command "/usr/bin/cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe keyspaces" 10.12.3.94" finished with status 0

cezarmoise avatar Dec 18 '25 16:12 cezarmoise