Bor cannot be gracefully shutdown when disconnected from Heimdall
System information
Tested with bor 1.1.0 and 1.2.3.
Overview of the problem
When bor is disconnected from Heimdall, it seems impossible to gracefully shutdown bor. This issue can easily led to corrupted DB (kill -9 bor generally get into "Head state missing, repairing").
Reproduction Steps
Spin up a bor instance with a wrong Heimdall URL:
docker run -it --rm 0xpolygon/bor:1.2.3 server --bor.heimdall=http://1.2.3.4:1234
Wait until the block synchronization started:
Block synchronization started
Then hit Ctrl-C or send signals:
docker exec -it <container_name> kill -INT 1
docker exec -it <container_name> kill -TERM 1
Logs / Traces / Output / Error Messages
Logs from bor after receiving an interrupt signal:
Caught signal: interrupt
Gracefully shutting down agent...
{"endpoint":"[::]:8545","lvl":"info","msg":"HTTP server stopped","t":"2024-01-25T11:51:04.015831701Z"}
{"endpoint":"[::]:8546","lvl":"info","msg":"HTTP server stopped","t":"2024-01-25T11:51:04.015918833Z"}
{"lvl":"info","msg":"IPC endpoint closed","t":"2024-01-25T11:51:04.016001352Z","url":"/data/bor.ipc"}
{"attempt":3,"lvl":"info","msg":"Retrying again in 5 seconds to fetch data from Heimdall","path":"/milestone/latest","t":"2024-01-25T11:51:08.100408984Z"}
{"attempt":3,"lvl":"info","msg":"Retrying again in 5 seconds to fetch data from Heimdall","path":"/checkpoints/latest","t":"2024-01-25T11:51:08.100450328Z"}
{"attempt":1,"error":"Get \"http://xxx/milestone/lastNoAck\": context deadline exceeded","lvl":"warn","msg":"an error while trying fetching from Heimdall","path":"/milestone/lastNoAck","t":"2024-01-25T11:51:08.100520499Z"}
{"attempt":1,"lvl":"info","msg":"Retrying again in 5 seconds to fetch data from Heimdall","path":"/milestone/lastNoAck","t":"2024-01-25T11:51:08.100554963Z"}
{"err":"context deadline exceeded","lvl":"eror","msg":"Failed to fetch latest no-ack milestone","t":"2024-01-25T11:51:08.100574545Z"}
...
Run into the same issue several times. This is very problematic
It shouldn't be a case, but we will check it.
This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.
It shouldn't be a case, but we will check it.
It totally is the case. Please do not close this issue
Experiencing the same issue
This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.
Hi, I am able to reproduce this issue locally. Will create a PR for fixing it soon (will updated here). Thanks for reporting.
This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.
We will release it in next release.