etcd [3.4] ./test: line 229: kill: (7520)

trafficstars

I saw this error multiple times in 3.4 pipeline,

 {"level":"info","ts":1658452719.9775898,"caller":"tester/cluster.go:585","msg":"closed tester HTTP server","tester-address":"127.0.0.1:9028"}
./test: line 229: kill: (7520) - No such process
./test: line 229: kill: (7521) - No such process
./test: line 229: kill: (7522) - No such process
ETCD_TESTER_EXIT_CODE: 2
FAILED! 'tail -1000 /tmp/etcd-functional-1/etcd.log'
tail: cannot open '/tmp/etcd-functional-1/etcd.log' for reading: No such file or directory
Error: Process completed with exit code 1.

Refer to https://github.com/etcd-io/etcd/runs/7460637442?check_suite_focus=true

Jul 22 '22 01:07 ahrtr

run functional testcase on release-3.4 branch. it output below error.

{"level":"info","ts":1658578145.555051,"caller":"tester/stresser.go:56","msg":"creating stresser","type":"KV_TXN_WRITE_DELETE","weight":0.35,"endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.555056,"caller":"tester/stresser.go:56","msg":"creating stresser","type":"LEASE","weight":0,"endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.555061,"caller":"tester/cluster.go:298","msg":"added lease stresser","endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.5550668,"caller":"tester/cluster.go:327","msg":"updated stressers"}
{"level":"info","ts":1658578145.555187,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.55519,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:1379"}
{"level":"info","ts":1658578145.5551932,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:2379"}
2022-07-23T20:09:05.555+0800	INFO	agent	agent/handler.go:40	handler success	{"address": "127.0.0.1:39027", "operation": "INITIAL_START_ETCD"}
2022-07-23T20:09:05.555+0800	INFO	agent	agent/handler.go:40	handler success	{"address": "127.0.0.1:29027", "operation": "INITIAL_START_ETCD"}
2022-07-23T20:09:05.555+0800	INFO	agent	agent/handler.go:40	handler success	{"address": "127.0.0.1:19027", "operation": "INITIAL_START_ETCD"}
{"level":"info","ts":1658578145.555926,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:3379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"info","ts":1658578145.5559459,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:2379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"info","ts":1658578145.5559502,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:1379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"fatal","ts":1658578145.55598,"caller":"etcd-tester/main.go:49","msg":"Bootstrap failed","error":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\", \"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\", \"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\"","stacktrace":"main.main\n\t/Users/xiaods/Documents/code/etcd/functional/cmd/etcd-tester/main.go:49\nruntime.main\n\t/usr/local/Cellar/go/1.17.6/libexec/src/runtime/proc.go:255"}
ETCD_TESTER_EXIT_CODE: 1

FAILED! 'tail -1000 /tmp/etcd-functional-1/etcd.log'
tail: cannot open '/tmp/etcd-functional-1/etcd.log' for reading: No such file or directory

Jul 23 '22 12:07 xiaods

Here are some more examples of failure due to ETCD_TESTER_EXIT_CODE: 2 https://github.com/etcd-io/etcd/runs/7528996274?check_suite_focus=true https://github.com/etcd-io/etcd/runs/7443519017?check_suite_focus=true https://github.com/etcd-io/etcd/runs/7436236229?check_suite_focus=true

Jul 28 '22 21:07 lavacat

if nobody is looking at this, please assign to me

Jul 29 '22 15:07 lavacat

still looking at this issue. Quick update. There are couple separate investigations:

Two runs failed due to {"level":"warn","ts":1658383863.827886,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":2,"case-total":22,"error":"injection error: cluster too slow: only 1961 commits in 6 retries"} 836 https://github.com/etcd-io/etcd/runs/7443519017?check_suite_focus=true https://github.com/etcd-io/etcd/runs/7436236229?check_suite_focus=true In test case BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT I was able to reproduce this locally. Consistently on Mac and once in docker image. Looking into this more.
This run failed due to {"level":"warn","ts":1658869799.140927,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":20,"case-total":22,"error":"wait full health error: etcdserver: too many requests ("127.0.0.1:1379")"} in test case BLACKHOLE_PEER_PORT_TX_RX_LEADER Haven't tried to reproduce this yet.
There is setup problem related to all test cases/branches. SIGQUIT_ETCD_AND_REMOVE_DATA_AND_STOP_AGENT removes tmp directories that record logs of the cluster and they are never appended to test output here. I'll submit a fix for this independently of prior 2 cases.

Aug 12 '22 17:08 lavacat

Let close this issue.

Quick summary

no solution for BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT (see note below).
seams to be a one-off
Filed https://github.com/etcd-io/etcd/issues/14384

BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT

On my local Mac etcd_disk_wal_fsync_duration_seconds_bucket is pretty high and that probably explains the slow cluster and that this test fails for me locally consistently. It will do ~700 commits out of 2000. Even if I remove injected failure 'BLACKHOLE_PEER_PORT_TX_RX' this test never passes, cluster is too slow for me locally.

From GitHub actions logs we can see that it got close to 2000 commits that are needed for snapshot.

cluster too slow: only 1961 commits in 6 retries luster too slow: only 1967 commits in 6 retries

So, my hypothesis is that it's noisy neighbor problem on GitHub action VM, but I can't prove it, probably need to look at metrics.

Aug 29 '22 06:08 lavacat

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Dec 31 '22 23:12 stale[bot]

etcd
etcd copied to clipboard

[3.4] ./test: line 229: kill: (7520) - No such process

Quick summary

BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT

etcd etcd copied to clipboard

[3.4] ./test: line 229: kill: (7520) - No such process

Quick summary

BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT

etcd
etcd copied to clipboard