go-spacemesh icon indicating copy to clipboard operation
go-spacemesh copied to clipboard

SIGKILL doesn't work while initializing tortoise

Open lrettig opened this issue 1 year ago • 1 comments

Description

The application freezes and won't exit cleanly after receiving SIGKILL. The last line in the log is "initializing tortoise".

Steps to reproduce

Start go-spacemesh. Wait a moment until this log line appears then try sending SIGKILL on Linux. (On my system this is managed by systemctl, see log below.)

Actual Behavior

Eventually the process gets killed using SIGTERM and thus doesn't exit cleanly (see log below)

Expected Behavior

The application should handle SIGKILL and exit gracefully

Environment

Please complete the following information:

  • OS: Linux
  • Node Version: 1.3.8

Additional Resources

Feb 08 22:01:08 lane-eth-mainnet-2 systemd[1]: Started Spacemesh.
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z        INFO        App version: v1.3.8. Git: b03e89b - b03e89bf99632b71f497ae60c53169038a66fdbd . Go Version: go1.21.5. O
S: linux-amd64 . Genesis 0x9eebff023abb17ccb775c602daade8ed708f0a50
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z        INFO        Welcome to Spacemesh. Spacemesh full node is starting...
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z        INFO        Looking for identity file at `/home/lane/post/data/key.bin`
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z        INFO        Loaded existing identity; public key: e2e8c45aabdd326b8a667aec7f959722dd13b2278e3d0c7a622cc2b3f5f4b9e7
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z        INFO        starting spacemesh        {"data-dir": "/home/lane/.spacemesh/data/7c8cef2b", "post-dir": "/home/lane/
post/data", "hostname": "lane-eth-mainnet-2", "name": ""}
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z        INFO        e2e8c.clock        converting genesis time to local time        {"node_id": "e2e8c45aabdd326b8a667aec7
f959722dd13b2278e3d0c7a622cc2b3f5f4b9e7", "module": "clock", "genesis": "2023-07-14T08:00:00.000Z", "local": "2023-07-14T08:00:00.000Z", "name": "clock"}
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z        INFO        e2e8c        initializing p2p services        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b2278e3
d0c7a622cc2b3f5f4b9e7"}
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.992Z        INFO        e2e8c.clock        starting global clock        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b2278
e3d0c7a622cc2b3f5f4b9e7", "module": "clock", "now": "2024-02-08T22:01:08.992Z", "genesis": "2023-07-14T08:00:00.000Z", "layer_duration": "5m0s", "tick_interval": "1s", "name": "clock"}
Feb 08 22:01:08 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:08.994Z        WARN        p2p-config        config/config.go:304        rcmgr limit conflicts with connmgr limit: conn manager h
igh watermark limit: 100, exceeds the system connection limit of: 1        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b2278e3d0c7a622cc2b3f5f4b9e7", "module": "p2p"}
Feb 08 22:01:09 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:09.026Z        INFO        e2e8c.stateDbStore        running migrations        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b
2278e3d0c7a622cc2b3f5f4b9e7", "module": "stateDbStore", "uri": "file:/home/lane/.spacemesh/data/7c8cef2b/state.sql", "current version": 10, "target version": 10}
Feb 08 22:01:09 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:01:09.026Z        INFO        starting cache warmup
Feb 08 22:03:39 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:39.332Z        INFO        cache warmup        {"duration": "2m30.305595024s", "name": ""}
Feb 08 22:03:39 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:39.332Z        INFO        e2e8c.cachedDB        initialized datastore        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b2
278e3d0c7a622cc2b3f5f4b9e7", "module": "cachedDB", "config": {"ATXSize":1000000,"MalfeasanceSize":1000}, "name": "cachedDB"}
Feb 08 22:03:39 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:39.339Z        INFO        e2e8c.stateDbStore        running migrations        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b
2278e3d0c7a622cc2b3f5f4b9e7", "module": "stateDbStore", "uri": "file:/home/lane/.spacemesh/data/7c8cef2b/node_state.sql", "current version": 2, "target version": 2}                                        Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z        INFO        e2e8c.nipostValidator        starting post verifier        {"node_id": "e2e8c45aabdd326b8a667aec7f9597
22dd13b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"}                                                                                                                                           Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z        INFO        e2e8c.nipostValidator        scaling post verifier        {"node_id": "e2e8c45aabdd326b8a667aec7f95972
2dd13b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator", "current": 0, "new": 4}
Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z        INFO        e2e8c.nipostValidator        started post verifier        {"node_id": "e2e8c45aabdd326b8a667aec7f95972
2dd13b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"}                                                                                                                                            Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z        INFO        e2e8c.nipostValidator.worker-1        starting        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd1
3b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"}                                                                                                                                                Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z        INFO        e2e8c.nipostValidator.worker-3        starting        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd1
3b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"}                                                                                                                                                Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.102Z        INFO        e2e8c.nipostValidator.worker-0        starting        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd1
3b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"}                                                                                                                                                Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.103Z        INFO        e2e8c.nipostValidator.worker-2        starting        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd1
3b2278e3d0c7a622cc2b3f5f4b9e7", "module": "nipostValidator"}                                                                                                                                                Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.104Z        INFO        e2e8c.beacon        registered signing key        {"node_id": "e2e8c45aabdd326b8a667aec7f959722dd13b22
78e3d0c7a622cc2b3f5f4b9e7", "module": "beacon", "id": "e2e8c", "name": "beacon"}                                                                                                                            Feb 08 22:03:40 lane-eth-mainnet-2 go-spacemesh[1697528]: 2024-02-08T22:03:40.104Z        INFO        initializing tortoise
Feb 08 22:10:05 lane-eth-mainnet-2 systemd[1]: Stopping Spacemesh...                                                                                                                                        Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: State 'stop-sigterm' timed out. Killing.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697528 (go-spacemesh) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697529 (go-spacemesh) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697531 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697534 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697535 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697537 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697538 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697539 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697540 (go-spacemesh) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Killing process 1697541 (n/a) with signal SIGKILL.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Main process exited, code=killed, status=9/KILL
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Failed with result 'timeout'.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: Stopped Spacemesh.
Feb 08 22:11:35 lane-eth-mainnet-2 systemd[1]: spacemesh.service: Consumed 9min 45.965s CPU time.

lrettig avatar Feb 08 '24 22:02 lrettig

i think the actual problem here is that it takes too long to recount votes. and the reason for that is the bug in how state was stored, so that verifying tortoise couldn't actually count votes correctly when tortoise was recovered from disk. that was fixed but not yet released, and it will take some time to update the state correctly.

we can check for context cancellation in the vote counting hot loop to make code interruptible regardless of such bugs.

dshulyak avatar Feb 19 '24 11:02 dshulyak