Israel Fruchter

Results 570 comments of Israel Fruchter

> I think we need to confirm the root cause before applying a fix. The root cause is most likely the fact that the driver is using the keyspace at...

happened again this week, but logs wasn't fully collected ## Installation details Kernel Version: 5.15.0-1035-aws Scylla version (or git commit hash): `5.3.0~dev-20230508.5fa459bd1a4a` with build-id `5d6a48f1aeb4a57ecbbb890013b0c3b2881ad268` Cluster size: 6 nodes (i4i.4xlarge)...

@soyacz can you take a closer look on what we can do to gather more information on this `ssh2.exceptions.SocketRecvError` failure ?

> @fruch isn't it a scylla issue? Not without proof, it's not 100% reproducible. we need a more focused reproducer to be able to show its scylla or not.

seems very similar to: https://github.com/scylladb/scylladb/issues/11528#issuecomment-1365206311 we get a reactor stall and ssh get disconnected at the same time: ``` 2023-05-07T09:23:31+00:00 longevity-tls-50gb-3d-master-db-node-e54b587c-24 !INFO | scylla[5484]: Reactor stalled for 32 ms on...

> @fruch to workaround this problem, I propose to add `WaitForRebuildCompletes(node)` context manager which could be used around `run_nodetool('rebuild')` that would wait for rebuild to be completed regardless of exception...

> I don't understand why we try to solve it in SCT and not in scylla. > It's not fixing the issue, it's more about recovering from it. The actual...

> > I don't understand why we try to solve it in SCT and not in scylla. > > > > It's not fixing the issue, it's more about recovering...

> Reproduced 2 times on longevity-10gb-3h-gce-test. > > ## How frequently does it reproduce? > Happened on longevity-10gb-3h-gce-test. Reproduced in builds 139 and 141 > > ## Installation details >...

@mykaul seems like we enough enough occurrences of this exactly during rebuild of a new node in a new region, and seen on AWS and GCP can you transfer this...