gemini fails with: mutation failed after 11 attempts (took 33.65245129s) with query: INSERT INTO ks1.table1 .. mutation error: gocql: no response received from cassandra within timeout period (potentially executed: true)
gemini-gocql-driver v1.15.3 2025-09-06T16:49:42Z e35803084ebafd200e3f7fd74a5be5dfdb409b2d gemini 2.1.5 2025-10-14T18:23:43Z 1bad12f14f6832dbdc2211627079d91d8c610bf3
https://argus.scylladb.com/tests/scylla-cluster-tests/1d477c36-a689-45cd-a21a-3f18b7a50bc5
got an error of:
[2025-12-25 11:19:18.913] {"level":"warn","ts":"2025-12-25T11:19:18.879051186Z","logger":"store.delegating_store","msg":"mutation failed, retrying with exponential backoff","attempt":10,"max_attempts":11,"e
rror":"mutation error: gocql: no response received from cassandra within timeout period (potentially executed: true), partition keys: {\"pk0\":[93607315507203],\"pk1\":[\"5cae5f2f97\"],\"pk2\":[\"b310ae70-4
6f7-1d0d-8343-0a4da8ced0cd\"],\"pk3\":[7096665136716317756]}","failed_stores":["test"],"successful_stores":["oracle"],"retrying_stores":["test"]}
[2025-12-25 11:19:18.913] {"level":"info","ts":"2025-12-25T11:19:18.881111198Z","logger":"jobs","msg":"stop jobs"}
[2025-12-25 11:19:18.913] {"level":"error","ts":"2025-12-25T11:19:18.881123066Z","msg":"failed to run gemini workload","error":"JobError(err=mutation failed after 11 attempts (took 33.65245129s) with query:
INSERT INTO ks1.table1 (pk0,pk1,pk2,pk3,ck0,ck1,ck2,ck3,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?) (has 4 partition keys)\n\nAttempt details:\n
attempt 0 [test (took 3.021206101s)]: mutation error: gocql: no response received from cassandra within timeout period (potentially executed: true), partition keys: {\"pk0\":[158099310902935],\"pk1\":[\"9
bc2e97cb65b2d9ec769dce6733198\"],\"pk2\":[\"c2ee0960-f504-1260-9026-0a4da8ced0cd\"],\"pk3\":[8373491747353809152]}\n attempt 1 [test (took 3.031168206s)]
@yarongilor what's the point running with 2.1.5 ?
Even running with 2.1.5, whats the issue here, i dont see it. If something cannot be executed adn gemini sent it to be executed, does not look like a bug to me, this is either SCT issue, driver or scylla. There is no error here. Driver reporting a query timeout, 10 retries done and it could not do it
What i see here is:
- Gemini retry system working fine
- Query and data generated and sent
- Timeout query retried until success or limit reached
Only possible thing here for gemini to handle is potentially executed part, but there is no guarantee that it exists in Scylla
@yarongilor what's the point running with 2.1.5 ?
@fruch , it's for the purpose of testing PR https://github.com/scylladb/scylla-cluster-tests/pull/13016 to be merged to master so it just tests 2.1.5 which is on master. Once this PR is merged, there's indeed no point using 2.1.5.
Lets test that PR with the new version (there is PR open for that in SCT) and ignore this issue unless it comes up in the new version
looks like some driver issue or Scylla. This error started to appear right after blocking one scylla node:
< t:2025-12-25 11:18:26,837 f:remote_base.py l:650 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.2.248>: Running command "sudo iptables -A INPUT -s 10.4.3.117 -p tcp --dport 19142 -j DROP"...
< t:2025-12-25 11:18:27,334 f:base.py l:276 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.3.117>: {"level":"error","ts":"2025-12-25T11:18:27.196552828Z","logger":"store.test_store","msg":"mutation failed","system":"test","query_type":"InsertJSONStatement","error":"gocql: no response received from cassandra within timeout period (potentially executed: true)"}
Lets test that PR with the new version (there is PR open for that in SCT) and ignore this issue unless it comes up in the new version
in order to test new gemini version 2.2.x, the PR of https://github.com/scylladb/scylla-cluster-tests/pull/12605 is required. But IIRC it recently faced other issues blocking it from being merged.
Lets test that PR with the new version (there is PR open for that in SCT) and ignore this issue unless it comes up in the new version
in order to test new gemini version 2.2.x, the PR of scylladb/scylla-cluster-tests#12605 is required. But IIRC it recently faced other issues blocking it from being merged.
@yarongilor Please update PR with info what blocks it.
@yarongilor this test is worth to investigate further - see that during the test, Gemini was generating only ~50 req/s. We need to know if this is just bad schema for scylla (it has many columns, complicated) and identify the bottleneck - doesn't look like CPU as 'load' graph is close to 0. Maybe Gemini was bottleneck here (but from this issue seems like Scylla didn't respond timely) - need to check CPU usage for Gemini instance.
since https://github.com/scylladb/scylla-cluster-tests/pull/12605 is merged we can retest with it now.
rerunning failed with: https://github.com/scylladb/gemini/issues/607
Closing this issue as it was moved to Jira. Please continue the thread in https://scylladb.atlassian.net/browse/QATOOLS-116