python-driver
python-driver copied to clipboard
test_request_too_large[SSL] get stuck on endloop of trying to connect to scylla
test goes into a loop like the following:
01:34:30,919 790 cassandra.cluster WARNING thread.py :58 | test_request_too_large[SSL]: Node 127.0.46.1:9042 is reporting a schema disagreement: None
01:34:59,357 790 cassandra.connection WARNING threading.py :1038 | test_request_too_large[SSL]: Heartbeat failed for connection (140630522150416) to 127.0.46.1:9042
01:35:04,358 790 cassandra.cluster WARNING thread.py :58 | test_request_too_large[SSL]: [control connection] Error connecting to 127.0.46.1:9042:
Traceback (most recent call last):
File "cassandra/cluster.py", line 3607, in cassandra.cluster.ControlConnection._connect_host_in_lbp
File "cassandra/cluster.py", line 3653, in cassandra.cluster.ControlConnection._try_connect
File "cassandra/cluster.py", line 1689, in cassandra.cluster.Cluster.connection_factory
File "cassandra/connection.py", line 885, in cassandra.connection.Connection.factory
cassandra.OperationTimedOut: errors=Timed out creating connection (5 seconds), last_host=None
01:35:09,360 790 cassandra.cluster WARNING thread.py :58 | test_request_too_large[SSL]: [control connection] Error connecting to 127.0.46.1:9042:
Traceback (most recent call last):
File "cassandra/cluster.py", line 3607, in cassandra.cluster.ControlConnection._connect_host_in_lbp
File "cassandra/cluster.py", line 3653, in cassandra.cluster.ControlConnection._try_connect
File "cassandra/cluster.py", line 1689, in cassandra.cluster.Cluster.connection_factory
File "cassandra/connection.py", line 885, in cassandra.connection.Connection.factory
cassandra.OperationTimedOut: errors=Timed out creating connection (5 seconds), last_host=None
01:35:14,361 790 cassandra.cluster WARNING thread.py :58 | test_request_too_large[SSL]: [control connection] Error connecting to 127.0.46.1:9042:
Traceback (most recent call last):
File "cassandra/cluster.py", line 3607, in cassandra.cluster.ControlConnection._connect_host_in_lbp
File "cassandra/cluster.py", line 3653, in cassandra.cluster.ControlConnection._try_connect
File "cassandra/cluster.py", line 1689, in cassandra.cluster.Cluster.connection_factory
File "cassandra/connection.py", line 885, in cassandra.connection.Connection.factory
cassandra.OperationTimedOut: errors=Timed out creating connection (5 seconds), last_host=None
I can see it locally (but not 100% of the times), trying to bisect if it's a specific scylla commit that introduced it.
reproducer:
pytest limits_test.py::TestLimits::test_request_too_large -k "[SSL" --scylla-version=unstable/master:2023-09-26T21:50:54Z --log-cli-level=debug -n 2 --count=10
reproduce on:
- 5.4.0~dev-0.20230924.5e8b7a9cef2d with build-id cd4b337695d964e34e73439c81966ef6edd14ee9
- 5.4.0~dev-0.20230925.6bd489e9507b with build-id 77c377eeb37bd43190206d4d83d74b9a399d86b3
- 2023.1.1
seems like it's an issue can happen regardless of which scylla version.
I can't repro this, neither locally (enterprise) or in docker (2023.1). Do we have CI runs where this happened?
I can't repro this, neither locally (enterprise) or in docker (2023.1). Do we have CI runs where this happened?
we are seeing it on master, and as I wrote above with enough runs and parallelism of the same test is reproduces locally for me
happend on: https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/386/ to https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/389/
It's a bit hard to see it on CI, since it case the timeout of the whole tests, but on here you can see the node log of it: https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/389/artifact/logs-full.release.000/orphaned/dtest-2i5lu3jo/
Well, the log indicates that the CQL connection gets a ENOTCONN
on the scylla side, right after the new table is created. This also happens when the test succeeds, and should maybe be considered a part of the CONNRESET logic...
Why the next connection attempt fails is a mystery. I am leaning on the client side being the issue, but as I still cannot repro this in either relocated or local env, I really don't know.
I am also perplexed by the:
Node 127.0.46.1:9042 is reporting a schema disagreement: None
in your/the failing log. This is not expected in the log.
I've run it locally a few times, and my conclusion it's a driver bug
failed one:
21:43:31,906 69 cassandra.pool DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: Received a connection 140560399457232 for shard_id=0 on host 127.0.88.1:9042
21:43:31,910 69 cassandra.pool DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: New connection (140560399457232) created to shard_id=0 on host 127.0.88.1:9042
21:43:32,119 69 cassandra.io.libevreactor DEBUG libevreactor.py :372 | test_request_too_large[SSL-13-15]: Connection <LibevConnection(140560398787984) 127.0.88.1:9042> closed by server
21:43:32,119 69 cassandra.io.libevreactor DEBUG libevreactor.py :287 | test_request_too_large[SSL-13-15]: Closing connection (140560398787984) to 127.0.88.1:9042
21:43:32,119 69 cassandra.io.libevreactor DEBUG libevreactor.py :291 | test_request_too_large[SSL-13-15]: Closed socket to 127.0.88.1:9042
21:43:32,120 69 cassandra.pool DEBUG libevreactor.py :295 | test_request_too_large[SSL-13-15]: Defunct or closed connection (140560398787984) returned to pool, potentially marking host 127.0.88.1:9042 as down
21:43:32,120 69 cassandra.pool DEBUG libevreactor.py :295 | test_request_too_large[SSL-13-15]: Shutting down connections to 127.0.88.1:9042
21:43:32,720 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Waiting for schema agreement
21:43:38,721 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:43:44,726 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:43:50,729 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:43:56,730 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:01,711 69 cassandra.connection DEBUG threading.py :1038 | test_request_too_large[SSL-13-15]: Cannot send heartbeat message on connection (140560398787984) to 127.0.88.1:9042
21:44:02,731 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:08,733 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:14,735 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:20,736 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:26,737 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:31,713 69 cassandra.connection DEBUG threading.py :1038 | test_request_too_large[SSL-13-15]: Cannot send heartbeat message on connection (140560398787984) to 127.0.88.1:9042
21:44:31,714 69 cassandra.connection DEBUG threading.py :1038 | test_request_too_large[SSL-13-15]: Sending options message heartbeat on idle connection (140560390349456) 127.0.88.1:9042
21:44:32,720 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:32,721 69 cassandra.cluster WARNING thread.py :58 | test_request_too_large[SSL-13-15]: Node 127.0.88.1:9042 is reporting a schema disagreement: None
21:44:32,722 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: Skipping schema refresh due to lack of schema agreement
21:44:32,733 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Waiting for schema agreement
21:44:38,737 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:44,739 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:50,740 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:56,741 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:45:01,715 69 cassandra.connection WARNING threading.py :1038 | test_request_too_large[SSL-13-15]: Heartbeat failed for connection (140560390349456) to 127.0.88.1:9042
21:45:01,715 69 cassandra.connection DEBUG threading.py :1038 | test_request_too_large[SSL-13-15]: Defuncting connection (140560390349456) to 127.0.88.1:9042: errors=Connection heartbeat timeout after 30 seconds, last_host=127.0.88.1:9042
21:45:01,716 69 cassandra.io.libevreactor DEBUG libevreactor.py :287 | test_request_too_large[SSL-13-15]: Closing connection (140560390349456) to 127.0.88.1:9042
21:45:01,716 69 cassandra.io.libevreactor DEBUG libevreactor.py :291 | test_request_too_large[SSL-13-15]: Closed socket to 127.0.88.1:9042
21:45:01,717 69 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-13-15]: [control connection] Error refreshing schema
working one:
21:43:45,413 66 limits_test INFO limits_test.py :377 | test_request_too_large[SSL-7-15]: Trying to send a large request to database (insert a large string into table)...
21:43:45,507 66 cassandra.protocol_features DEBUG protocol_features.py:56 | test_request_too_large[SSL-7-15]: Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['0'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'SCYLLA_SHARD_AWARE_PORT': ['19042']}
21:43:45,507 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Received options response on new connection (140588010907728) from 127.0.29.1:9042
21:43:45,507 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
21:43:45,508 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Sending StartupMessage on <LibevConnection(140588010907728) 127.0.29.1:9042>
21:43:45,508 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Sent StartupMessage on <LibevConnection(140588010907728) 127.0.29.1:9042>
21:43:45,508 66 cassandra.io.libevreactor DEBUG libevreactor.py :372 | test_request_too_large[SSL-7-15]: Connection <LibevConnection(140588240189008) 127.0.29.1:9042> closed by server
21:43:45,508 66 cassandra.io.libevreactor DEBUG libevreactor.py :287 | test_request_too_large[SSL-7-15]: Closing connection (140588240189008) to 127.0.29.1:9042
21:43:45,508 66 cassandra.io.libevreactor DEBUG libevreactor.py :291 | test_request_too_large[SSL-7-15]: Closed socket to 127.0.29.1:9042
21:43:45,508 66 cassandra.pool DEBUG libevreactor.py :295 | test_request_too_large[SSL-7-15]: Defunct or closed connection (140588240189008) returned to pool, potentially marking host 127.0.29.1:9042 as down
21:43:45,508 66 cassandra.pool DEBUG libevreactor.py :295 | test_request_too_large[SSL-7-15]: Shutting down connections to 127.0.29.1:9042
21:43:45,508 66 cassandra.pool DEBUG libevreactor.py :295 | test_request_too_large[SSL-7-15]: Closing connection (140588240189008) to 127.0.29.1:9042
21:43:45,508 66 cassandra.cluster WARNING libevreactor.py :295 | test_request_too_large[SSL-7-15]: Host 127.0.29.1:9042 has been marked down
21:43:45,508 66 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: [control connection] Control connection host (127.0.29.1:9042) is considered down, starting reconnection
21:43:45,508 66 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: Removed connection pool for <Host: 127.0.29.1:9042 datacenter1>
21:43:45,509 66 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: Starting reconnector for host 127.0.29.1:9042
21:43:45,509 66 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: [control connection] Attempting to reconnect
21:43:45,509 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Got ReadyMessage on new connection (140588010907728) from 127.0.29.1:9042
21:43:45,509 66 cassandra.pool DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: Host 127.0.29.1:9042 is now marked up
21:43:45,509 66 cassandra.pool DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: Received a connection 140588010907728 for shard_id=0 on host 127.0.29.1:9042
21:43:45,510 66 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: [control connection] Opening new connection to 127.0.29.1:9042
21:43:45,510 66 cassandra.pool DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: Pool for host 127.0.29.1:9042 is in shutdown, closing the new connection (140588010907728)
21:43:45,510 66 cassandra.io.libevreactor DEBUG libevreactor.py :287 | test_request_too_large[SSL-7-15]: Closing connection (140588010907728) to 127.0.29.1:9042
21:43:45,510 66 cassandra.io.libevreactor DEBUG libevreactor.py :291 | test_request_too_large[SSL-7-15]: Closed socket to 127.0.29.1:9042
21:43:45,511 66 cassandra.pool DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: Shutting down connections to 127.0.29.1:9042
21:43:45,511 66 cassandra.pool DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: Initializing connection for host 127.0.29.1:9042
21:43:45,512 66 cassandra.cluster DEBUG dtest_setup.py :436 | test_request_too_large[SSL-7-15]: Connecting to cluster, contact points: ['127.0.29.1']; protocol version: 4
21:43:45,512 66 cassandra.pool DEBUG dtest_setup.py :436 | test_request_too_large[SSL-7-15]: Host 127.0.29.1:9042 is now marked up
21:43:45,513 66 cassandra.cluster DEBUG dtest_setup.py :436 | test_request_too_large[SSL-7-15]: [control connection] Opening new connection to 127.0.29.1:9042
21:43:45,514 66 cassandra.connection DEBUG libevreactor.py :267 | test_request_too_large[SSL-7-15]: Connection 140588010976016 127.0.0.1:40320 -> 127.0.29.1:9042
21:43:45,515 66 cassandra.connection DEBUG libevreactor.py :274 | test_request_too_large[SSL-7-15]: Sending initial options message for new connection (140588010976016) to 127.0.29.1:9042
21:43:45,515 66 cassandra.connection DEBUG libevreactor.py :267 | test_request_too_large[SSL-7-15]: Connection 140588011015248 127.0.0.1:40332 -> 127.0.29.1:9042
21:43:45,516 66 cassandra.connection DEBUG libevreactor.py :274 | test_request_too_large[SSL-7-15]: Sending initial options message for new connection (140588011015248) to 127.0.29.1:9042
21:43:45,519 66 cassandra.connection DEBUG libevreactor.py :267 | test_request_too_large[SSL-7-15]: Connection 140588011024912 127.0.0.1:40348 -> 127.0.29.1:9042
21:43:45,519 66 cassandra.connection DEBUG libevreactor.py :274 | test_request_too_large[SSL-7-15]: Sending initial options message for new connection (140588011024912) to 127.0.29.1:9042
21:43:45,560 66 cassandra.protocol_features DEBUG protocol_features.py:56 | test_request_too_large[SSL-7-15]: Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['0'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'SCYLLA_SHARD_AWARE_PORT': ['19042']}
21:43:45,560 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Received options response on new connection (140588011015248) from 127.0.29.1:9042
21:43:45,560 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
21:43:45,560 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Sending StartupMessage on <LibevConnection(140588011015248) 127.0.29.1:9042>
21:43:45,560 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Sent StartupMessage on <LibevConnection(140588011015248) 127.0.29.1:9042>
21:43:45,561 66 cassandra.protocol_features DEBUG protocol_features.py:56 | test_request_too_large[SSL-7-15]: Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['1'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'SCYLLA_SHARD_AWARE_PORT': ['19042']}
21:43:45,561 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Received options response on new connection (140588010976016) from 127.0.29.1:9042
21:43:45,561 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
21:43:45,561 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Sending StartupMessage on <LibevConnection(140588010976016) 127.0.29.1:9042>
21:43:45,561 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Sent StartupMessage on <LibevConnection(140588010976016) 127.0.29.1:9042>
21:43:45,561 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Got ReadyMessage on new connection (140588011015248) from 127.0.29.1:9042
21:43:45,561 66 cassandra.pool DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: First connection created to 127.0.29.1:9042 for shard_id=0
21:43:45,561 66 cassandra.connection DEBUG libevreactor.py :370 | test_request_too_large[SSL-7-15]: Got ReadyMessage on new connection (140588010976016) from 127.0.29.1:9042
21:43:45,561 66 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: [control connection] Established new connection <LibevConnection(140588010976016) 127.0.29.1:9042>, registering watchers and refreshing schema and topology
21:43:45,562 66 cassandra.pool DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: Finished initializing connection for host 127.0.29.1:9042
21:43:45,562 66 cassandra.cluster DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: Added pool for host 127.0.29.1:9042 to session
21:43:45,562 66 cassandra.pool DEBUG thread.py :58 | test_request_too_large[SSL-7-15]: shard_aware_endpoint=None
looks like there are case the driver doesn't always reconnect after the connection closed by scylla (cause of the too big request) and keep on fails on trying to connect, not clear why.
@roydahan @mykaul, can you move this one to scylladb/python-driver ?
after debug it more on the driver end, seems like it's happening as the following
- we connect to one shard
- 2nd shard is starting connection on the background
- we send out the big request that goes to the 2nd connection, and scylla closes it, while are inside
set_keyspace_blocking
and holding the pool lock
also, seems like we are doing a query inside set_keyspace_blocking
without timeout
Aha, very tricky.
Any idea why the connection is closed by the server? Of course, the driver should be prepared for it.
Any idea why the connection is closed by the server? Of course, the driver should be prepared for it.
Well the test is sending a huge request on purpose, but the connection getting stuck here isn't the connection that should be closed cause of the too big request.
So it's not sure to me yet why we are seeing it in the SSL case only BTW.
The hunch is that we recently started using one ssl_context, so it might be affecting it too.
Anyhow just putting a timeout on the stuck request is fixing the situation. And each request should have some timeout, especially if used within a lock.
Just need to see it doesn't break all other cases.
and the plot thickens a bit,
we recent update the ssl version we are using in this test from ssl.PROTOCOL_TLSv1_2
to sl.PROTOCOL_TLS_CLIENT
(which mean lastet i.e. v3)
and with ssl.PROTOCOL_TLSv1_2
I can't reproduce the issue at all
so there is something relate to the TLS protocol that cause this change of behavior...
@elcallio any idea what's going on?
Is there a driver version where things work, and one where it does not? Again, I cannot repro this at all. PROTOCOL_TLS_CLIENT
should mean we negotiate to TLSv3, but I don't see how that would impact detection or disconnect or error propagation.
IIUC, python SSL is just an openssl wrapper.
looking at seastar recent changes, this change might be affecting the the closing of connections logic ? scylladb/seastar@7164ae143e0d0493a1c8bd91b45e9c0c87ab5deb
Cc @xemul
Is there a driver version where things work, and one where it does not? Again, I cannot repro this at all.
PROTOCOL_TLS_CLIENT
should mean we negotiate to TLSv3, but I don't see how that would impact detection or disconnect or error propagation. IIUC, python SSL is just an openssl wrapper.
It happen with the latest driver version 3.26.3, and latest scylla master
it happens about 1/10 like that:
pytest limits_test.py::TestLimits::test_request_too_large -k "[SSL" --scylla-version=unstable/master:latest --log-cli-level=debug -n 2 --count=10
you can play with the n
or count
to make it happen a bit more.
I can easily reproduce it locally, multiple times like that
looking at seastar recent changes, this change might be affecting the the closing of connections logic ? scylladb/seastar@7164ae1
I think that change is safe, unless some code in scylla decided to actually turn off wait-for-eof...
It happen with the latest driver version 3.26.3, and latest scylla master
it happens about 1/10 like that:
pytest limits_test.py::TestLimits::test_request_too_large -k "[SSL" --scylla-version=unstable/master:latest --log-cli-level=debug -n 2 --count=10
you can play with the
n
orcount
to make it happen a bit more.I can easily reproduce it locally, multiple times like that
Running the same, iterations of 10, 20 more... No result what so ever. Same driver, same scylla. Weird. Most likely a network setup difference...
It happen with the latest driver version 3.26.3, and latest scylla master
it happens about 1/10 like that:
pytest limits_test.py::TestLimits::test_request_too_large -k "[SSL" --scylla-version=unstable/master:latest --log-cli-level=debug -n 2 --count=10
you can play with the
n
orcount
to make it happen a bit more.I can easily reproduce it locally, multiple times like that
Running the same, iterations of 10, 20 more... No result what so ever. Same driver, same scylla. Weird. Most likely a network setup difference...
Maybe different openssl versions Try running with docker
I did. :-) Inside and outside the dtest container.
https://github.com/scylladb/scylla-dtest/pull/3670 is hiding this issue for now, as far as I can see.
TODO: run the test with tcpdump, so we working network dump, and failing network dump, to compare
and now with tcpdumps:
aws s3 cp s3://scylla-qa-public/fruch/python-driver-issue-261/log_and_tcpdump.tar.gz .
the failed dump:
logs/failure/ccm_node.crl
logs/failure/ccm_node.cer
logs/failure/ccm_node.pem
logs/failure/ccm_node.key
logs/failure/tcpdump__127.0.29.1_19042.pcap
logs/failure/tcpdump__127.0.29.1_9042.pcap
two example of the test passing
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/node1.log
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/ccm_node.key
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/ccm_node.crl
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/ccm_node.cer
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/tcpdump__127.0.62.1_19042.pcap
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/node1_system.log.jmx
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/ccm_node.pem
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/tcpdump__127.0.62.1_9042.pcap
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/tcpdump__127.0.62.1_19042.pcap
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/node1_system.log.jmx
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/ccm_node.key
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/ccm_node.pem
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/ccm_node.cer
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/ccm_node.crl
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/tcpdump__127.0.62.1_9042.pcap
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/node1.log
I figured I've captured only one direction of the communication, so here's one that capture both directions (similar directory structure):
aws s3 cp s3://scylla-qa-public/fruch/python-driver-issue-261/log_and_tcpdump_2_direction.tar.gz
can also be download by https:
wget https://scylla-qa-public.s3.amazonaws.com/fruch/python-driver-issue-261/log_and_tcpdump_2_direction.tar.gz
@elcallio
did you had a change to look at those tcp dumps ?
@fruch - did you manage to decode those?
INFO 2023-10-18 21:52:33,782 [shard 1:stat] cql_server - exception while processing connection: std::system_error (error system:107, Transport endpoint is not connected)
WARN 2023-10-18 21:52:33,983 [shard 0:stat] seastar_memory - oversized allocation: 8388608 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62c05 0x15668fc 0x27d961c 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
--------
seastar::lambda_task<seastar::execution_stage::flush()::$_0>
WARN 2023-10-18 21:52:33,993 [shard 0:stat] seastar_memory - oversized allocation: 1970176 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
--------
seastar::lambda_task<seastar::execution_stage::flush()::$_0>
WARN 2023-10-18 21:52:33,997 [shard 0:stat] seastar_memory - oversized allocation: 3936256 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
--------
seastar::lambda_task<seastar::execution_stage::flush()::$_0>
WARN 2023-10-18 21:52:34,003 [shard 0:stat] seastar_memory - oversized allocation: 7868416 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
--------
@fruch - did you manage to decode those?
INFO 2023-10-18 21:52:33,782 [shard 1:stat] cql_server - exception while processing connection: std::system_error (error system:107, Transport endpoint is not connected) WARN 2023-10-18 21:52:33,983 [shard 0:stat] seastar_memory - oversized allocation: 8388608 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62c05 0x15668fc 0x27d961c 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4 -------- seastar::lambda_task<seastar::execution_stage::flush()::$_0> WARN 2023-10-18 21:52:33,993 [shard 0:stat] seastar_memory - oversized allocation: 1970176 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4 -------- seastar::lambda_task<seastar::execution_stage::flush()::$_0> WARN 2023-10-18 21:52:33,997 [shard 0:stat] seastar_memory - oversized allocation: 3936256 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4 -------- seastar::lambda_task<seastar::execution_stage::flush()::$_0> WARN 2023-10-18 21:52:34,003 [shard 0:stat] seastar_memory - oversized allocation: 7868416 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4 --------
they are irrelevant, this test is intetantilly create hugh requests, so those are excepted, it's not expected for the client to stuck cause of those.
This test is invalidated in newer scylla.