subnet-evm
subnet-evm copied to clipboard
DFK subnet evm cannot be shutdown gracefully
Describe the bug The DFK subnet-EVM process unclean shutdown will cause historical state regeneration. Regeneration will take a long time that downgrades the RPC service time.
To Reproduce systemctl stop avalanchego or docker stop avlalachego
Expected behavior The DFK process starts quickly and provides RPC services immediately.
Screenshots

Logs
[09-08|02:30:54.480] INFO <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> snowman/transitive.go:327 shutting down consensus engine
2022-09-08T02:30:54.579Z [ERROR] plugin process exited: path=/opt/apps/dfk/bin/plugins/mDV3QWRXfwgKUWb9sggkv4vQxAQR4y2CyKrt5pLZ5SzQ7EHBv pid=435501 error="signal: terminated"
[09-08|02:30:54.581] ERROR <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> handler/handler.go:775 failed while shutting down the chain {"error": "rpc error: code = Unavailable desc = error reading from server: read unix @->/tmp/plugin3795322057: read: connection reset by peer"}
INFO [09-08|03:13:33.071] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/eth/backend.go:191: Initialising Ethereum protocol network=53935 dbversion=8
INFO [09-08|03:13:33.076] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/core/blockchain.go:521: Loaded most recent local header number=6,934,933 hash=24c903..5bba0f age=3m34s
INFO [09-08|03:13:33.076] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/core/blockchain.go:522: Loaded most recent local full block number=6,934,933 hash=24c903..5bba0f age=3m34s
INFO [09-08|03:13:33.076] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/core/blockchain.go:1442: Loaded Acceptor tip hash=24c903..5bba0f
INFO [09-08|03:13:33.077] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/core/blockchain.go:1450: Skipping state reprocessing root=5f8cf5..52aabb
INFO [09-08|03:13:33.078] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/core/blockchain.go:1424: Initializing snapshots async=true rebuild=true headHash=24c903..5bba0f headRoot=5f8cf5..52aabb
INFO [09-08|03:13:33.079] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/core/blockchain.go:388: Starting Acceptor queue length=64
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:63: Old unclean shutdowns found count=6
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2022-09-03T08:04:04+0000 age=4d19h9m
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2022-09-04T22:39:26+0000 age=3d4h34m
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2022-09-06T15:25:33+0000 age=1d11h48m
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2022-09-07T01:38:40+0000 age=1d1h34m
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2022-09-07T10:34:09+0000 age=16h39m24s
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2022-09-07T10:56:20+0000 age=16h17m13s
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2022-09-07T15:25:31+0000 age=11h48m2s
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2022-09-08T00:14:09+0000 age=2h59m24s
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2022-09-08T01:09:13+0000 age=2h4m20s
WARN [09-08|03:13:33.084] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2022-09-08T02:26:12+0000 age=47m21s
Metrics If applicable, please include any metrics gathered from your node to assist us in diagnosing the problem.
Operating System Ubuntu 20.04 and Kubernetes env.
Additional context Add any other context about the problem here.
Avalanche Bug Bounty program can be found here.
This is a known issue. If the node does not shut down cleanly, this means that it does not have time to write the last accepted block's state to disk and will need to re-generate the last accepted block state from the most recent state that it does have on disk previously.
Could you provide more details about the state of your node before shutting it down as well as how you are terminating it?
This is a known issue. If the node does not shut down cleanly, this means that it does not have time to write the last accepted block's state to disk and will need to re-generate the last accepted block state from the most recent state that it does have on disk previously.
Could you provide more details about the state of your node before shutting it down as well as how you are terminating it?
Here is my startup.sh:
#!/usr/bin/env sh
BASE_DIR=/opt/apps/dfk
DB_DIR=/dfk/db
CHAIN_CONFIGS_DIR=${BASE_DIR}/configs/chains
BUILD_DIR=${BASE_DIR}/bin
LOG_DIR=/var/log/dfk
${BUILD_DIR}/avalanchego --db-dir $DB_DIR \
--whitelisted-subnets Vn3aX6hNRstj5VHHm63TCgPNaeGnRSqCYXQqemSqDd2TQH4qJ \
--http-host 0.0.0.0 --http-port 8545 \
--chain-config-dir $CHAIN_CONFIGS_DIR \
--build-dir $BUILD_DIR \
--log-dir $LOG_DIR \
--dynamic-public-ip opendns
And I use systemd to control the process life-cycle.
[Unit]
Description=avalanchego
StartLimitIntervalSec=500
StartLimitBurst=5
[Service]
Restart=on-failure
RestartSec=5s
WorkingDirectory=/opt/apps/dfk
ExecStartPre=/bin/chmod +x /opt/apps/dfk/start.sh
ExecStart=/bin/bash /opt/apps/dfk/start.sh
Type=simple
User=root
[Install]
WantedBy=multi-user.target
This can happen if the node times out while shutting down as mentioned above. Does this happen consistently for you?
cc @hexfusion looks like the signal terminated log is a result of the shutdown process timing out, such that AvalancheGo sends a kill signal. Could you confirm this is not an issue there?
This can happen if the node times out while shutting down as mentioned above. Does this happen consistently for you?
cc @hexfusion looks like the
signal terminatedlog is a result of the shutdown process timing out, such that AvalancheGo sends a kill signal. Could you confirm this is not an issue there?
Yep, it happens everytime we restart the node.
Following up here, are you still running into this issue? We've fixed a few bugs since this was last updated, please comment if this is still impacting you otherwise will close this ticket.
Hi @aaronbuchwald We have the exact same issue. Avalanchego version: 1.9.16 OS: Ubuntu 20.04 LTS Go: 1.19.7
Systemd unit file:
[Unit] Description=AVAX Node Documentation=https://docs.avax.network/build/tutorials/nodes-and-staking/run-avalanche-node Wants=network-online.target After=network-online.target
[Service]
Type=simple
User=avax
Group=avax
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=
[Install] WantedBy=multi-user.target
systemctl stop avax; systemctl start avax Relevant q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi.log entries: vm server: graceful termination success INFO [04-19|07:12:43.462] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/eth/backend.go:184: Initialising Ethereum protocol network=53935 dbversion=8 INFO [04-19|07:12:43.476] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/core/blockchain.go:715: Loaded most recent local header number=16,205,802 hash=6038f3..580bc2 age=2m19s INFO [04-19|07:12:43.476] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/core/blockchain.go:716: Loaded most recent local full block number=16,205,802 hash=6038f3..580bc2 age=2m19s INFO [04-19|07:12:43.484] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/core/blockchain.go:1744: Loaded Acceptor tip hash=261be6..71823a INFO [04-19|07:12:46.041] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/core/blockchain.go:1804: Re-executing blocks to generate state for last accepted block from=16,203,777 to=16,205,801 .... WARN [04-19|07:46:20.443] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/internal/shutdowncheck/shutdown_tracker.go:67: Unclean shutdown detected booted=2023-04-19T07:10:20+0200 age=36m .... PS: same for Swimmer subnet
Same issues on avalanchego 1.20.8 & subnet 0.5.3. @aaronbuchwald
Can you share the full logs directory of your node during shutdown?
Hi @aaronbuchwald ,
Most relevant line: Aug 28 18:58:49 <node_name> avax[1559572]: [08-28|18:58:49.035] WARN <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> router/chain_router.go:379 timed out while shutting down {"error": "context deadline exceeded"}
FIle attached with full logs. avax.log
PS: same issues on all 7 nodes with DFK. OS: Ubuntu 20.04 & 22.04. GO: 1.20.6 AVAX releases: 1.10.3 & 1.10.8 Subnet EVM releases: 0.5.1 & 0.5.3
Thanks for sharing, I see the interesting logs here:
Aug 28 18:57:49 <node_name> systemd[1]: Stopping AVAX Node...
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.034] INFO node/node.go:1490 shutting down node {"exitCode": 0}
Aug 28 18:57:49 <node_name> avax[1559572]: runtime engine: received shutdown signal: SIGTERM
Aug 28 18:57:49 <node_name> avax[1559572]: runtime engine: received shutdown signal: SIGTERM
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.034] INFO health/worker.go:111 registered new check and initialized its state to failing {"namespace": "health", "name": "shuttingDown", "tags": ["application"]}
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.035] INFO ipcs/chainipc.go:123 shutting down chain IPCs
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.035] INFO chains/manager.go:1412 shutting down chain manager
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.035] INFO chains/manager.go:1405 stopping chain creator
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.035] INFO router/chain_router.go:361 shutting down chain router
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.035] INFO <C Chain> snowman/transitive.go:339 shutting down consensus engine
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.035] INFO <X Chain> snowman/transitive.go:339 shutting down consensus engine
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.035] INFO <P Chain> snowman/transitive.go:339 shutting down consensus engine
Aug 28 18:57:49 <node_name> avax[1559572]: INFO [08-28|18:57:49.036] <C Chain> github.com/ava-labs/coreth/core/txpool/txpool.go:469: Transaction pool stopped
Aug 28 18:57:49 <node_name> avax[1559572]: INFO [08-28|18:57:49.036] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:925: Closing quit channel
Aug 28 18:57:49 <node_name> avax[1559572]: INFO [08-28|18:57:49.036] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:928: Stopping Acceptor
Aug 28 18:57:49 <node_name> avax[1559572]: INFO [08-28|18:57:49.036] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:931: Acceptor queue drained t="5.35µs"
Aug 28 18:57:49 <node_name> avax[1559572]: INFO [08-28|18:57:49.036] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:934: Shutting down sender cacher
Aug 28 18:57:49 <node_name> avax[1559572]: INFO [08-28|18:57:49.036] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:938: Closing scope
Aug 28 18:57:49 <node_name> avax[1559572]: INFO [08-28|18:57:49.036] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:942: Waiting for background processes to complete
Aug 28 18:57:49 <node_name> avax[1559572]: INFO [08-28|18:57:49.036] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:951: Shutting down state manager
Aug 28 18:57:49 <node_name> avax[1559572]: runtime engine: received shutdown signal: SIGTERM
Aug 28 18:57:49 <node_name> avax[1559572]: vm server: graceful termination success
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.039] INFO subprocess/runtime.go:111 stdout collector shutdown
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.039] INFO subprocess/runtime.go:124 stderr collector shutdown
Aug 28 18:57:49 <node_name> avax[1559572]: vm server: graceful termination success
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.042] INFO subprocess/runtime.go:111 stdout collector shutdown
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.042] INFO subprocess/runtime.go:124 stderr collector shutdown
Aug 28 18:57:49 <node_name> avax[1559572]: [08-28|18:57:49.064] INFO <P Chain> router/chain_router.go:383 chain shutdown {"shutdownDuration": "28.740177ms"}
Aug 28 18:57:50 <node_name> avax[1559572]: INFO [08-28|18:57:50.592] <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> github.com/ava-labs/subnet-evm/rpc/handler.go:526: Served eth_estimateGas reqid=512988 execTime=238.072141ms procTime=238.0772ms totalTime=238.077452ms err="execution reverted: not enough stamina" errdata=0x08c379a0000000000000000000000000000000000000000000000000000000000000002000000000000000000000000000000000000000000000000000000000000000126e6f7420656e6f756768207374616d696e610000000000000000000000000000
Aug 28 18:57:51 <node_name> avax[1559572]: vm server: graceful termination success
Aug 28 18:57:51 <node_name> avax[1559572]: [08-28|18:57:51.582] INFO <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> subprocess/runtime.go:111 stdout collector shutdown
Aug 28 18:57:51 <node_name> avax[1559572]: [08-28|18:57:51.582] INFO <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> subprocess/runtime.go:124 stderr collector shutdown
Aug 28 18:57:51 <node_name> avax[1559572]: INFO [08-28|18:57:51.715] <C Chain> github.com/ava-labs/coreth/trie/triedb/hashdb/database.go:519: Persisted trie from memory database nodes=197,463 size=58.60MiB time=2.67884263s gcnodes=591,688 gcsize=229.20MiB gctime=2.723029347s livenodes=6707 livesize=2.64MiB
Aug 28 18:57:51 <node_name> avax[1559572]: INFO [08-28|18:57:51.715] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:956: State manager shut down t=2.678934081s
Aug 28 18:57:51 <node_name> avax[1559572]: INFO [08-28|18:57:51.715] <C Chain> github.com/ava-labs/coreth/core/blockchain.go:962: Blockchain stopped
Aug 28 18:57:51 <node_name> avax[1559572]: [08-28|18:57:51.718] INFO <C Chain> router/chain_router.go:383 chain shutdown {"shutdownDuration": "2.683030355s"}
Aug 28 18:57:51 <node_name> avax[1559572]: [08-28|18:57:51.718] INFO <X Chain> router/chain_router.go:383 chain shutdown {"shutdownDuration": "85.179µs"}
Aug 28 18:58:49 <node_name> avax[1559572]: [08-28|18:58:49.035] WARN <q2aTwKuyzgs8pynF7UXBZCU7DejbZbZ6EUyHr3JQzYgwNPUPi Chain> router/chain_router.go:379 timed out while shutting down {"error": "context deadline exceeded"}
Aug 28 18:58:49 <node_name> avax[1559572]: [08-28|18:58:49.036] INFO network/network.go:1306 shutting down the p2p networking
Aug 28 18:58:59 <node_name> avax[1559572]: [08-28|18:58:59.066] INFO node/node.go:1543 cleaning up plugin runtimes
Aug 28 18:58:59 <node_name> avax[1559572]: [08-28|18:58:59.069] INFO node/node.go:1565 finished node shutdown
Aug 28 18:58:59 <node_name> avax[1559572]: [08-28|18:58:59.070] INFO nat/nat.go:178 Unmapped all ports
Aug 28 18:58:59 <node_name> systemd[1]: avax.service: Succeeded.
Aug 28 18:58:59 <node_name> systemd[1]: Stopped AVAX Node.
After the signal to shutdown is received, the X, P, and C chains all shut down, but there are no further logs from the VM.
It should be the case that Shutdown is called on the VM, which emits further logs: https://github.com/ava-labs/subnet-evm/blob/master/plugin/evm/vm.go#L642. @hexfusion do the RPCChainVM logs look as expected to you here? I'm curious if shutdown is called on the underlying Subnet-EVM instance.
Hi we have reproduced the issue on the ubuntu machine using systemctl and see the same results as you. It seems that systemctl sends a SIGTERM signal rather than a SIGINT signal. We are putting in a fix for this right now in the mean time you can edit your unit file to have this line KillSignal=SIGINT to look something like this @andreicupaciu
[Unit]
Description=Avalanche go client
After=syslog.target network.target
[Service]
User=avalanche
Type=simple
Environment=HOME=/home/avalanche
ExecStart=/usr/local/bin/avalanchego --config-file /etc/avalanchego/conf.json
KillMode=process
KillSignal=SIGINT
TimeoutStopSec=90
Restart=on-failure
RestartSec=10s
LimitNOFILE=65000
[Install]
WantedBy=multi-user.target
This was fixed in #853 . Please let us know if you see a similar issue and thanks for reporting!