cosmos-sdk
cosmos-sdk copied to clipboard
[Bug]: [Cosmovisor] Upgrade is applied immediately when using the --upgrade-height flag
Is there an existing issue for this?
- [X] I have searched the existing issues
What happened?
Cosmovisor restarts with the new binary immediately after running add-upgrade with the --upgrade-height flag.
Cosmos SDK Version
Cosmovisor v1.5.0
How to reproduce?
- Start a Gaia v13 chain with Cosmovisor v1.5.0
- Download a Gaia v14 binary to
gaiad - Run
cosmovisor add-upgrade v14 gaiad --upgrade-height 300at any point before reaching block 300. - The node will start right away with the downloaded binary.
I used this script to run steps 2 and 3:
#!/bin/bash
export DAEMON_NAME=gaiad
export DAEMON_HOME=/home/gaia/.gaia
export DAEMON_DATA_BACKUP_DIR=/home/gaia/backup
height=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.last_block_height')
version=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.version')
echo "The node is running $version at height $height."
wget https://github.com/cosmos/gaia/releases/download/v14.0.0-rc0/gaiad-v14.0.0-rc0-linux-amd64 -q -O gaiad
chmod +x gaiad
command="cosmovisor add-upgrade v14 gaiad --upgrade-height 300"
echo "$command"
$command
echo "Waiting ten seconds..."
sleep 10
height=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.last_block_height')
version=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.version')
echo "The node is running $version at height $height."
Upgrade info:
cat ~/.gaia/cosmovisor/upgrades/v14/upgrade-info.json
{"name":"v14","time":"0001-01-01T00:00:00Z","height":300}
Pre-upgrade cosmos_sdk_version: v0.45.16 (Gaia v13.0.2)
Post-upgrade cosmos_sdk_version: v0.45.16 (Gaia v14.0.0-rc0)
Can you dump the upgrade info here too? And the sdk versions as well.
@julienrbrt Upgrade info:
cat ~/.gaia/cosmovisor/upgrades/v14/upgrade-info.json
{"name":"v14","time":"0001-01-01T00:00:00Z","height":300}
Pre-upgrade cosmos_sdk_version: v0.45.16 (Gaia v13.0.2)
Post-upgrade cosmos_sdk_version: v0.45.16 (Gaia v14.0.0-rc0)
Thanks, last question, what gives gaiad status just before add-upgrade ?
@julienrbrt gaiad status returns this before running add-upgrade:
{"NodeInfo":{"protocol_version":{"p2p":"8","block":"11","app":"0"},"id":"af3e918c3e2533ceb3be8c48c3068bf6682fb67c","listen_addr":"tcp://0.0.0.0:26656","network":"testnet","version":"0.34.29","channels":"40202122233038606100","moniker":"cosmos-node","other":{"tx_index":"on","rpc_address":"tcp://0.0.0.0:26657"}},"SyncInfo":{"latest_block_hash":"760D16D9D591C769184A2A4F659AB730C80494BD84633FC0451EFABE524307F4","latest_app_hash":"67F695FF4B37D5D12F9A64B83F32679DE289F3AC3A62427CC2EEFC84B61C1634","latest_block_height":"8","latest_block_time":"2024-01-24T19:12:39.769313361Z","earliest_block_hash":"067188F2ADF5931924FD7562506F7DF457C2329F2A1AF00BB777E12FEA273A29","earliest_app_hash":"E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855","earliest_block_height":"1","earliest_block_time":"2024-01-24T17:08:52.990568831Z","catching_up":false},"ValidatorInfo":{"Address":"51A4AFE4B661E141691029CABF023E3C976FF28E","PubKey":{"type":"tendermint/PubKeyEd25519","value":"kkk4KLWvqb8AKcJJcO5VpBvbOqfds7XqqZ6HRxzZPz0="},"VotingPower":"8000"}}
Weird, then it should be parsed properly: https://github.com/cosmos/cosmos-sdk/blob/main/tools/cosmovisor/scanner.go#167
I also got bitten by that issue!
I also got bitten by that issue!
:/ I am re-prioritizing cosmovisor work. I'll re-investigate this.
I am observing the same behaviour with dydxv4 cosmos scheduled for Jul 25th 2024, 07:50:07+00:00 UTC (in 10 hours) Anyone happened to know if downgrade is an option?
I have encountered this issue with the Warden testnet as well. command used:
DAEMON_NAME=wardend
DAEMON_HOME=/home/warden/.warden
UPGRADE_HEIGHT=1534500
rm -rf ${HOME}/upgrade
mkdir ${HOME}/upgrade
cd ${HOME}/upgrade
wget https://github.com/warden-protocol/wardenprotocol/releases/download/v0.4.0/wardend_Linux_x86_64.zip
unzip wardend_Linux_x86_64.zip
rm wardend_Linux_x86_64.zip CHANGELOG.md LICENSE README.md
chmod +x ${HOME}/upgrade/wardend
cosmovisor add-upgrade "v040" ${HOME}/upgrade/wardend --upgrade-height ${UPGRADE_HEIGHT}
After executing cosmovisor add-upgrade, the node immediately restarts and uses the new binary, resulting in a crash.
I faced this yesterday on Neutron emergency upgrade on 2 nodes. Might be because I used --force flag as there was already an upgrade-info.json file there.
@julienrbrt can we get this prioritised please? as this is quite dangerous
Seeing this same issue on Cosmos Provider testnet, the issue seems to be random as I have an identical node both used with horcrux one worked fine the other tried applying the upgrade immediately.
experiencing the same issue on current theta and provider testnet.
Looking at this again right now!
@julienrbrt I see a potential problem here, in scanner.go
result, err := exec.Command(fw.currentBin, "status").Output() //nolint:gosec // we want to execute the status command
if err != nil {
return 0, err
}
// file exist but too early in height
currentHeight, _ := fw.checkHeight()
if currentHeight != 0 && currentHeight < info.Height {
return false
}
Avoid launching the node binary like that as much as possible, especially without any command-line options. A more thorough way to get data from the running node is to parse config/config.toml once, find its current RPC endpoint, and send an RPC request each time you need to get the current block height. E.g. some of my shell code:
# Function to get the local HTTP RPC endpoint from the node's configuration
# TODO: This is currently limited to Neutron. Cosmos SDK exposes a different "config" command (`config get config rpc.laddr`), will need to change this function to recognize that
get_local_rpc_endpoint() { "$NODE_BINARY" config --home "$NODE_DIR" | jq -er .node | sed s/tcp/http/; }
# Function to send a call to the HTTP RPC endpoint
call_rpc_endpoint() { curl -s --connect-timeout 5 --max-time 20 --retry 5 --retry-delay 0 --retry-max-time 5 "$1/$2" | jq -er .result; }
# Function to get the latest block height via HTTP RPC
get_latest_block_height() { call_rpc_endpoint "$1" "status?" | jq -er .sync_info.latest_block_height; }
# Function to check if we have proper access to the local node
validate_local_node() {
LOCAL_RPC_ENDPOINT=""
# Check if the node's binary is readable and executable and if the node's directory exists and is readable
if [ -d "$NODE_DIR" ] && [ -r "$NODE_DIR" ] && [ -x "$NODE_BINARY" ] && [ -r "$NODE_BINARY" ]; then
LOCAL_RPC_ENDPOINT=$(get_local_rpc_endpoint); [ -z "$LOCAL_RPC_ENDPOINT" ] && print_error_and_exit "node_error"
return 0;
else return 1; fi
}
# Main function to compare heights between the local node and the remote RPC
compare_heights() {
local_height=$(get_latest_block_height "$LOCAL_RPC_ENDPOINT")
remote_height=$(get_latest_block_height "$REMOTE_RPC_ENDPOINT")
{ [ -z "$local_height" ] || [ -z "$remote_height" ]; } && print_error_and_exit "curl_error"
delta=$((remote_height - local_height))
[ "$delta" -le "$NODE_LAG" ] && \
{ printf "Node %s is not lagging, delta with %s is %s, local height is %s, remote height is %s\n" \
"$LOCAL_RPC_ENDPOINT" "$REMOTE_RPC_ENDPOINT" "$delta" "$local_height" "$remote_height"; exit 0; } || \
{ printf "Node %s is lagging, delta with %s is %s, local height is %, remote height is %s!\n" \
"$LOCAL_RPC_ENDPOINT" "$REMOTE_RPC_ENDPOINT" "$delta" "$local_height" "$remote_height"; exit 1; }
}
This is significantly faster because it avoids launching another executable, even if for a moment (I'm still doing that but you can use something like a mandatory DAEMON_RPC_ENDPOINT variable instead). It is compatible with nodes that run the RPC server on a different port, which gaiad status won't pick up by itself without --node. If there are multiple nodes on one machine, this code can end up requesting the status of the node that runs on the default RPC port and using that value of latest_block_height to check if the upgrade should be applied immediately to a different node. It also avoids possible side effects related to outputs and exit codes of the status command. For example, in some earlier versions of CosmosSDK I noticed that this command was printing its output to stderr, not stdout, as did the main logger. Finally, it could resolve your TODO problem related to test binaries :) Of course, I don't know if either of these situations is the culprit behind this specific problem, but refactoring this part of the code to use an RPC call is a good idea anyway and will aid in any further debugging.
@julienrbrt when can we expect cosmovisor v1.6.0 which I assume will include that fix?
@julienrbrt when can we expect cosmovisor v1.6.0 which I assume will include that fix?
Hey, now: https://github.com/cosmos/cosmos-sdk/releases/tag/tools%2Fcosmovisor%2Fv1.6.0