go-spacemesh
go-spacemesh copied to clipboard
SIGKILL doesn't work while initializing tortoise
Description
The application freezes and won't exit cleanly after receiving SIGKILL. The last line in the log is "initializing tortoise".
Steps to reproduce
Start go-spacemesh. Wait a moment until this log line appears then try sending SIGKILL on Linux. (On my system this is managed by systemctl, see log below.)
Actual Behavior
Eventually the process gets killed using SIGTERM and thus doesn't exit cleanly (see log below)
Expected Behavior
The application should handle SIGKILL and exit gracefully
Environment
Please complete the following information:
- OS: Linux
- Node Version: 1.3.8
Additional Resources
Feb 08 22:01:08 lane-eth-mainnet-2 systemd[1]: Started Spacemesh.
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z INFO App version: v1.3.8. Git: b03e89b - b03e89bf99632b71f497ae60c53169038a66fdbd . Go Version: go1.21.5. O
S: linux-amd64 . Genesis 0x9eebff023abb17ccb775c602daade8ed708f0a50
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z INFO Welcome to Spacemesh. Spacemesh full node is starting...
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z INFO Looking for identity file at `/home/lane/post/data/key.bin`
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z INFO Loaded existing identity; public key: e2e8c45aabdd326b8a667aec7f959722dd13b2278e3d0c7a622cc2b3f5f4b9e7
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z INFO starting spacemesh {"data-dir": "/home/lane/.spacemesh/data/7c8cef2b", "post-dir": "/home/lane/
post/data", "hostname": "lane-eth-mainnet-2", "name": ""}
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z INFO e2e8c.clock converting genesis time to local time {"node_id": "e2e8c45aabdd326b8a667aec7
f959722dd13b2278e3d0c7a622cc2b3f5f4b9e7", "module": "clock", "genesis": "2023-07-14T08:00:00.000Z", "local": "2023-07-14T08:00:00.000Z", "name": "clock"}
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z INFO e2e8c initializing p2p services {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b2278e3
d0c7a622cc2b3f5f4b9e7"}
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z INFO e2e8c.clock starting global clock {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b2278
e3d0c7a622cc2b3f5f4b9e7", "module": "clock", "now": "2024-02-08T22:01:08.992Z", "genesis": "2023-07-14T08:00:00.000Z", "layer_duration": "5m0s", "tick_interval": "1s", "name": "clock"}
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.994Z WARN p2p-config config/config.go:304 rcmgr limit conflicts with connmgr limit: conn manager h
igh watermark limit: 100, exceeds the system connection limit of: 1 {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b2278e3d0c7a622cc2b3f5f4b9e7", "module": "p2p"}
Feb 08 22:01:09 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:09.026Z INFO e2e8c.stateDbStore running migrations {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b
2278e3d0c7a622cc2b3f5f4b9e7", "module": "stateDbStore", "uri": "file:/home/lane/.spacemesh/data/7c8cef2b/state.sql", "current version": 10, "target version": 10}
Feb 08 22:01:09 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:09.026Z INFO starting cache warmup
Feb 08 22:03:39 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:39.332Z INFO cache warmup {"duration": "2m30.305595024s", "name": ""}
Feb 08 22:03:39 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:39.332Z INFO e2e8c.cachedDB initialized datastore {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b2
278e3d0c7a622cc2b3f5f4b9e7", "module": "cachedDB", "config": {"ATXSize":1000000,"MalfeasanceSize":1000}, "name": "cachedDB"}
Feb 08 22:03:39 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:39.339Z INFO e2e8c.stateDbStore running migrations {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b
2278e3d0c7a622cc2b3f5f4b9e7", "module": "stateDbStore", "uri": "file:/home/lane/.spacemesh/data/7c8cef2b/node_state.sql", "current version": 2, "target version": 2} Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z INFO e2e8c.nipostValidator starting post verifier {"node_id": "e2e8c45aabdd326b8a667aec7f9597
22dd13b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"} Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z INFO e2e8c.nipostValidator scaling post verifier {"node_id": "e2e8c45aabdd326b8a667aec7f95972
2dd13b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator", "current": 0, "new": 4}
Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z INFO e2e8c.nipostValidator started post verifier {"node_id": "e2e8c45aabdd326b8a667aec7f95972
2dd13b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"} Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z INFO e2e8c.nipostValidator.worker-1 starting {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd1
3b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"} Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z INFO e2e8c.nipostValidator.worker-3 starting {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd1
3b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"} Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z INFO e2e8c.nipostValidator.worker-0 starting {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd1
3b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"} Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.103Z INFO e2e8c.nipostValidator.worker-2 starting {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd1
3b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"} Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.104Z INFO e2e8c.beacon registered signing key {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b22
78e3d0c7a622cc2b3f5f4b9e7", "module": "beacon", "id": "e2e8c", "name": "beacon"} Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.104Z INFO initializing tortoise
Feb 08 22:10:05 lane-eth-mainnet-2 systemd[1]: Stopping Spacemesh... Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: State 'stop-sigterm' timed out. Killing.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697528 (go-spacemesh) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697529 (go-spacemesh) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697531 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697534 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697535 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697537 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697538 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697539 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697540 (go-spacemesh) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697541 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Main process exited, code=killed, status=9/KILL
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Failed with result 'timeout'.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: Stopped Spacemesh.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Consumed 9min 45.965s CPU time.
i think the actual problem here is that it takes too long to recount votes. and the reason for that is the bug in how state was stored, so that verifying tortoise couldn't actually count votes correctly when tortoise was recovered from disk. that was fixed but not yet released, and it will take some time to update the state correctly.
we can check for context cancellation in the vote counting hot loop to make code interruptible regardless of such bugs.