etcd
etcd copied to clipboard
[3.4] ./test: line 229: kill: (7520) - No such process
I saw this error multiple times in 3.4 pipeline,
{"level":"info","ts":1658452719.9775898,"caller":"tester/cluster.go:585","msg":"closed tester HTTP server","tester-address":"127.0.0.1:9028"}
./test: line 229: kill: (7520) - No such process
./test: line 229: kill: (7521) - No such process
./test: line 229: kill: (7522) - No such process
ETCD_TESTER_EXIT_CODE: 2
FAILED! 'tail -1000 /tmp/etcd-functional-1/etcd.log'
tail: cannot open '/tmp/etcd-functional-1/etcd.log' for reading: No such file or directory
Error: Process completed with exit code 1.
Refer to https://github.com/etcd-io/etcd/runs/7460637442?check_suite_focus=true
run functional testcase on release-3.4 branch. it output below error.
{"level":"info","ts":1658578145.555051,"caller":"tester/stresser.go:56","msg":"creating stresser","type":"KV_TXN_WRITE_DELETE","weight":0.35,"endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.555056,"caller":"tester/stresser.go:56","msg":"creating stresser","type":"LEASE","weight":0,"endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.555061,"caller":"tester/cluster.go:298","msg":"added lease stresser","endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.5550668,"caller":"tester/cluster.go:327","msg":"updated stressers"}
{"level":"info","ts":1658578145.555187,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.55519,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:1379"}
{"level":"info","ts":1658578145.5551932,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:2379"}
2022-07-23T20:09:05.555+0800 INFO agent agent/handler.go:40 handler success {"address": "127.0.0.1:39027", "operation": "INITIAL_START_ETCD"}
2022-07-23T20:09:05.555+0800 INFO agent agent/handler.go:40 handler success {"address": "127.0.0.1:29027", "operation": "INITIAL_START_ETCD"}
2022-07-23T20:09:05.555+0800 INFO agent agent/handler.go:40 handler success {"address": "127.0.0.1:19027", "operation": "INITIAL_START_ETCD"}
{"level":"info","ts":1658578145.555926,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:3379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"info","ts":1658578145.5559459,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:2379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"info","ts":1658578145.5559502,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:1379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"fatal","ts":1658578145.55598,"caller":"etcd-tester/main.go:49","msg":"Bootstrap failed","error":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\", \"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\", \"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\"","stacktrace":"main.main\n\t/Users/xiaods/Documents/code/etcd/functional/cmd/etcd-tester/main.go:49\nruntime.main\n\t/usr/local/Cellar/go/1.17.6/libexec/src/runtime/proc.go:255"}
ETCD_TESTER_EXIT_CODE: 1
FAILED! 'tail -1000 /tmp/etcd-functional-1/etcd.log'
tail: cannot open '/tmp/etcd-functional-1/etcd.log' for reading: No such file or directory
Here are some more examples of failure due to ETCD_TESTER_EXIT_CODE: 2 https://github.com/etcd-io/etcd/runs/7528996274?check_suite_focus=true https://github.com/etcd-io/etcd/runs/7443519017?check_suite_focus=true https://github.com/etcd-io/etcd/runs/7436236229?check_suite_focus=true
if nobody is looking at this, please assign to me
still looking at this issue. Quick update. There are couple separate investigations:
-
Two runs failed due to {"level":"warn","ts":1658383863.827886,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":2,"case-total":22,"error":"injection error: cluster too slow: only 1961 commits in 6 retries"} 836 https://github.com/etcd-io/etcd/runs/7443519017?check_suite_focus=true https://github.com/etcd-io/etcd/runs/7436236229?check_suite_focus=true In test case BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT I was able to reproduce this locally. Consistently on Mac and once in docker image. Looking into this more.
-
This run failed due to {"level":"warn","ts":1658869799.140927,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":20,"case-total":22,"error":"wait full health error: etcdserver: too many requests ("127.0.0.1:1379")"} in test case BLACKHOLE_PEER_PORT_TX_RX_LEADER Haven't tried to reproduce this yet.
-
There is setup problem related to all test cases/branches. SIGQUIT_ETCD_AND_REMOVE_DATA_AND_STOP_AGENT removes tmp directories that record logs of the cluster and they are never appended to test output here. I'll submit a fix for this independently of prior 2 cases.
Let close this issue.
Quick summary
- no solution for BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT (see note below).
- seams to be a one-off
- Filed https://github.com/etcd-io/etcd/issues/14384
BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT
On my local Mac etcd_disk_wal_fsync_duration_seconds_bucket is pretty high and that probably explains the slow cluster and that this test fails for me locally consistently. It will do ~700 commits out of 2000.
Even if I remove injected failure 'BLACKHOLE_PEER_PORT_TX_RX' this test never passes, cluster is too slow for me locally.
From GitHub actions logs we can see that it got close to 2000 commits that are needed for snapshot.
cluster too slow: only 1961 commits in 6 retries
luster too slow: only 1967 commits in 6 retries
So, my hypothesis is that it's noisy neighbor problem on GitHub action VM, but I can't prove it, probably need to look at metrics.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.