gemini icon indicating copy to clipboard operation
gemini copied to clipboard

gemini fails with: mutation failed after 11 attempts (took 33.65245129s) with query: INSERT INTO ks1.table1 .. mutation error: gocql: no response received from cassandra within timeout period (potentially executed: true)

Open yarongilor opened this issue 2 months ago • 10 comments

gemini-gocql-driver v1.15.3 2025-09-06T16:49:42Z e35803084ebafd200e3f7fd74a5be5dfdb409b2d gemini 2.1.5 2025-10-14T18:23:43Z 1bad12f14f6832dbdc2211627079d91d8c610bf3

https://argus.scylladb.com/tests/scylla-cluster-tests/1d477c36-a689-45cd-a21a-3f18b7a50bc5

got an error of:

[2025-12-25 11:19:18.913] {"level":"warn","ts":"2025-12-25T11:19:18.879051186Z","logger":"store.delegating_store","msg":"mutation failed, retrying with exponential backoff","attempt":10,"max_attempts":11,"e
rror":"mutation error: gocql: no response received from cassandra within timeout period (potentially executed: true), partition keys: {\"pk0\":[93607315507203],\"pk1\":[\"5cae5f2f97\"],\"pk2\":[\"b310ae70-4
6f7-1d0d-8343-0a4da8ced0cd\"],\"pk3\":[7096665136716317756]}","failed_stores":["test"],"successful_stores":["oracle"],"retrying_stores":["test"]}
[2025-12-25 11:19:18.913] {"level":"info","ts":"2025-12-25T11:19:18.881111198Z","logger":"jobs","msg":"stop jobs"}
[2025-12-25 11:19:18.913] {"level":"error","ts":"2025-12-25T11:19:18.881123066Z","msg":"failed to run gemini workload","error":"JobError(err=mutation failed after 11 attempts (took 33.65245129s) with query:
 INSERT INTO ks1.table1 (pk0,pk1,pk2,pk3,ck0,ck1,ck2,ck3,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)  (has 4 partition keys)\n\nAttempt details:\n
  attempt 0 [test (took 3.021206101s)]: mutation error: gocql: no response received from cassandra within timeout period (potentially executed: true), partition keys: {\"pk0\":[158099310902935],\"pk1\":[\"9
bc2e97cb65b2d9ec769dce6733198\"],\"pk2\":[\"c2ee0960-f504-1260-9026-0a4da8ced0cd\"],\"pk3\":[8373491747353809152]}\n  attempt 1 [test (took 3.031168206s)]

yarongilor avatar Dec 25 '25 17:12 yarongilor

@yarongilor what's the point running with 2.1.5 ?

fruch avatar Dec 27 '25 17:12 fruch

Even running with 2.1.5, whats the issue here, i dont see it. If something cannot be executed adn gemini sent it to be executed, does not look like a bug to me, this is either SCT issue, driver or scylla. There is no error here. Driver reporting a query timeout, 10 retries done and it could not do it

What i see here is:

  • Gemini retry system working fine
  • Query and data generated and sent
  • Timeout query retried until success or limit reached

Only possible thing here for gemini to handle is potentially executed part, but there is no guarantee that it exists in Scylla

CodeLieutenant avatar Dec 28 '25 08:12 CodeLieutenant

@yarongilor what's the point running with 2.1.5 ?

@fruch , it's for the purpose of testing PR https://github.com/scylladb/scylla-cluster-tests/pull/13016 to be merged to master so it just tests 2.1.5 which is on master. Once this PR is merged, there's indeed no point using 2.1.5.

yarongilor avatar Dec 28 '25 11:12 yarongilor

Lets test that PR with the new version (there is PR open for that in SCT) and ignore this issue unless it comes up in the new version

pehala avatar Dec 29 '25 08:12 pehala

looks like some driver issue or Scylla. This error started to appear right after blocking one scylla node:

< t:2025-12-25 11:18:26,837 f:remote_base.py  l:650  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.2.248>: Running command "sudo iptables -A INPUT -s 10.4.3.117 -p tcp --dport 19142 -j DROP"...
< t:2025-12-25 11:18:27,334 f:base.py         l:276  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.4.3.117>: {"level":"error","ts":"2025-12-25T11:18:27.196552828Z","logger":"store.test_store","msg":"mutation failed","system":"test","query_type":"InsertJSONStatement","error":"gocql: no response received from cassandra within timeout period (potentially executed: true)"}

soyacz avatar Dec 29 '25 12:12 soyacz

Lets test that PR with the new version (there is PR open for that in SCT) and ignore this issue unless it comes up in the new version

in order to test new gemini version 2.2.x, the PR of https://github.com/scylladb/scylla-cluster-tests/pull/12605 is required. But IIRC it recently faced other issues blocking it from being merged.

yarongilor avatar Dec 29 '25 12:12 yarongilor

Lets test that PR with the new version (there is PR open for that in SCT) and ignore this issue unless it comes up in the new version

in order to test new gemini version 2.2.x, the PR of scylladb/scylla-cluster-tests#12605 is required. But IIRC it recently faced other issues blocking it from being merged.

@yarongilor Please update PR with info what blocks it.

soyacz avatar Dec 29 '25 14:12 soyacz

@yarongilor this test is worth to investigate further - see that during the test, Gemini was generating only ~50 req/s. We need to know if this is just bad schema for scylla (it has many columns, complicated) and identify the bottleneck - doesn't look like CPU as 'load' graph is close to 0. Maybe Gemini was bottleneck here (but from this issue seems like Scylla didn't respond timely) - need to check CPU usage for Gemini instance.

soyacz avatar Dec 29 '25 15:12 soyacz

since https://github.com/scylladb/scylla-cluster-tests/pull/12605 is merged we can retest with it now.

yarongilor avatar Dec 31 '25 09:12 yarongilor

rerunning failed with: https://github.com/scylladb/gemini/issues/607

yarongilor avatar Jan 04 '26 17:01 yarongilor

Closing this issue as it was moved to Jira. Please continue the thread in https://scylladb.atlassian.net/browse/QATOOLS-116

dani-tweig avatar Jan 18 '26 05:01 dani-tweig