cosmos-sdk icon indicating copy to clipboard operation
cosmos-sdk copied to clipboard

[Bug]: [Cosmovisor] Upgrade is applied immediately when using the --upgrade-height flag

Open dasanchez opened this issue 1 year ago • 6 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

What happened?

Cosmovisor restarts with the new binary immediately after running add-upgrade with the --upgrade-height flag.

Cosmos SDK Version

Cosmovisor v1.5.0

How to reproduce?

  1. Start a Gaia v13 chain with Cosmovisor v1.5.0
  2. Download a Gaia v14 binary to gaiad
  3. Run cosmovisor add-upgrade v14 gaiad --upgrade-height 300 at any point before reaching block 300.
  4. The node will start right away with the downloaded binary.

I used this script to run steps 2 and 3:

#!/bin/bash

export DAEMON_NAME=gaiad
export DAEMON_HOME=/home/gaia/.gaia
export DAEMON_DATA_BACKUP_DIR=/home/gaia/backup
height=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.last_block_height')
version=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.version')
echo "The node is running $version at height $height."
wget https://github.com/cosmos/gaia/releases/download/v14.0.0-rc0/gaiad-v14.0.0-rc0-linux-amd64 -q -O gaiad
chmod +x gaiad
command="cosmovisor add-upgrade v14 gaiad --upgrade-height 300"
echo "$command"
$command
echo "Waiting ten seconds..."
sleep 10
height=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.last_block_height')
version=$(curl -s http://localhost:26657/abci_info | jq -r '.result.response.version')
echo "The node is running $version at height $height."

Upgrade info:

cat ~/.gaia/cosmovisor/upgrades/v14/upgrade-info.json 
{"name":"v14","time":"0001-01-01T00:00:00Z","height":300}

Pre-upgrade cosmos_sdk_version: v0.45.16 (Gaia v13.0.2) Post-upgrade cosmos_sdk_version: v0.45.16 (Gaia v14.0.0-rc0)

dasanchez avatar Jan 24 '24 18:01 dasanchez

Can you dump the upgrade info here too? And the sdk versions as well.

julienrbrt avatar Jan 24 '24 18:01 julienrbrt

@julienrbrt Upgrade info:

cat ~/.gaia/cosmovisor/upgrades/v14/upgrade-info.json 
{"name":"v14","time":"0001-01-01T00:00:00Z","height":300}

Pre-upgrade cosmos_sdk_version: v0.45.16 (Gaia v13.0.2) Post-upgrade cosmos_sdk_version: v0.45.16 (Gaia v14.0.0-rc0)

dasanchez avatar Jan 24 '24 18:01 dasanchez

Thanks, last question, what gives gaiad status just before add-upgrade ?

julienrbrt avatar Jan 24 '24 19:01 julienrbrt

@julienrbrt gaiad status returns this before running add-upgrade:

{"NodeInfo":{"protocol_version":{"p2p":"8","block":"11","app":"0"},"id":"af3e918c3e2533ceb3be8c48c3068bf6682fb67c","listen_addr":"tcp://0.0.0.0:26656","network":"testnet","version":"0.34.29","channels":"40202122233038606100","moniker":"cosmos-node","other":{"tx_index":"on","rpc_address":"tcp://0.0.0.0:26657"}},"SyncInfo":{"latest_block_hash":"760D16D9D591C769184A2A4F659AB730C80494BD84633FC0451EFABE524307F4","latest_app_hash":"67F695FF4B37D5D12F9A64B83F32679DE289F3AC3A62427CC2EEFC84B61C1634","latest_block_height":"8","latest_block_time":"2024-01-24T19:12:39.769313361Z","earliest_block_hash":"067188F2ADF5931924FD7562506F7DF457C2329F2A1AF00BB777E12FEA273A29","earliest_app_hash":"E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855","earliest_block_height":"1","earliest_block_time":"2024-01-24T17:08:52.990568831Z","catching_up":false},"ValidatorInfo":{"Address":"51A4AFE4B661E141691029CABF023E3C976FF28E","PubKey":{"type":"tendermint/PubKeyEd25519","value":"kkk4KLWvqb8AKcJJcO5VpBvbOqfds7XqqZ6HRxzZPz0="},"VotingPower":"8000"}}

dasanchez avatar Jan 24 '24 19:01 dasanchez

Weird, then it should be parsed properly: https://github.com/cosmos/cosmos-sdk/blob/main/tools/cosmovisor/scanner.go#167

julienrbrt avatar Jan 24 '24 19:01 julienrbrt

I also got bitten by that issue!

fmorency avatar May 15 '24 15:05 fmorency

I also got bitten by that issue!

:/ I am re-prioritizing cosmovisor work. I'll re-investigate this.

julienrbrt avatar May 30 '24 14:05 julienrbrt

I am observing the same behaviour with dydxv4 cosmos scheduled for Jul 25th 2024, 07:50:07+00:00 UTC (in 10 hours) Anyone happened to know if downgrade is an option?

steven-varga avatar Jul 24 '24 21:07 steven-varga

I have encountered this issue with the Warden testnet as well. command used:

DAEMON_NAME=wardend
DAEMON_HOME=/home/warden/.warden
UPGRADE_HEIGHT=1534500

rm -rf ${HOME}/upgrade
mkdir ${HOME}/upgrade
cd ${HOME}/upgrade
wget https://github.com/warden-protocol/wardenprotocol/releases/download/v0.4.0/wardend_Linux_x86_64.zip
unzip wardend_Linux_x86_64.zip
rm wardend_Linux_x86_64.zip CHANGELOG.md LICENSE README.md
chmod +x ${HOME}/upgrade/wardend
cosmovisor add-upgrade "v040" ${HOME}/upgrade/wardend --upgrade-height ${UPGRADE_HEIGHT}

After executing cosmovisor add-upgrade, the node immediately restarts and uses the new binary, resulting in a crash.

BillyJunID avatar Jul 29 '24 08:07 BillyJunID

I faced this yesterday on Neutron emergency upgrade on 2 nodes. Might be because I used --force flag as there was already an upgrade-info.json file there.

@julienrbrt can we get this prioritised please? as this is quite dangerous

freak12techno avatar Jul 31 '24 09:07 freak12techno

Seeing this same issue on Cosmos Provider testnet, the issue seems to be random as I have an identical node both used with horcrux one worked fine the other tried applying the upgrade immediately.

Bosco-2019 avatar Jul 31 '24 13:07 Bosco-2019

experiencing the same issue on current theta and provider testnet.

fish2plain avatar Aug 07 '24 13:08 fish2plain

Looking at this again right now!

julienrbrt avatar Aug 07 '24 15:08 julienrbrt

@julienrbrt I see a potential problem here, in scanner.go

	result, err := exec.Command(fw.currentBin, "status").Output() //nolint:gosec // we want to execute the status command
	if err != nil {
		return 0, err
	}
	// file exist but too early in height
	currentHeight, _ := fw.checkHeight()
	if currentHeight != 0 && currentHeight < info.Height {
		return false
	}

Avoid launching the node binary like that as much as possible, especially without any command-line options. A more thorough way to get data from the running node is to parse config/config.toml once, find its current RPC endpoint, and send an RPC request each time you need to get the current block height. E.g. some of my shell code:

# Function to get the local HTTP RPC endpoint from the node's configuration
# TODO: This is currently limited to Neutron. Cosmos SDK exposes a different "config" command (`config get config rpc.laddr`), will need to change this function to recognize that
get_local_rpc_endpoint() { "$NODE_BINARY" config --home "$NODE_DIR" | jq -er .node | sed s/tcp/http/; }

# Function to send a call to the HTTP RPC endpoint
call_rpc_endpoint() { curl -s --connect-timeout 5 --max-time 20 --retry 5 --retry-delay 0 --retry-max-time 5 "$1/$2" | jq -er .result; }

# Function to get the latest block height via HTTP RPC
get_latest_block_height() { call_rpc_endpoint "$1" "status?" | jq -er .sync_info.latest_block_height; }

# Function to check if we have proper access to the local node
validate_local_node() {
  LOCAL_RPC_ENDPOINT=""
  # Check if the node's binary is readable and executable and if the node's directory exists and is readable
  if [ -d "$NODE_DIR" ] && [ -r "$NODE_DIR" ] && [ -x "$NODE_BINARY" ] && [ -r "$NODE_BINARY" ]; then
    LOCAL_RPC_ENDPOINT=$(get_local_rpc_endpoint); [ -z "$LOCAL_RPC_ENDPOINT" ] && print_error_and_exit "node_error"
    return 0;
  else return 1; fi
}

# Main function to compare heights between the local node and the remote RPC
compare_heights() {
  local_height=$(get_latest_block_height "$LOCAL_RPC_ENDPOINT")
  remote_height=$(get_latest_block_height "$REMOTE_RPC_ENDPOINT")
  { [ -z "$local_height" ] || [ -z "$remote_height" ]; } && print_error_and_exit "curl_error"
  delta=$((remote_height - local_height))
  [ "$delta" -le "$NODE_LAG" ] && \
  { printf "Node %s is not lagging, delta with %s is %s, local height is %s, remote height is %s\n" \
  "$LOCAL_RPC_ENDPOINT" "$REMOTE_RPC_ENDPOINT" "$delta" "$local_height" "$remote_height"; exit 0; } || \
  { printf "Node %s is lagging, delta with %s is %s, local height is %, remote height is %s!\n" \
  "$LOCAL_RPC_ENDPOINT" "$REMOTE_RPC_ENDPOINT" "$delta" "$local_height" "$remote_height"; exit 1; }
}

This is significantly faster because it avoids launching another executable, even if for a moment (I'm still doing that but you can use something like a mandatory DAEMON_RPC_ENDPOINT variable instead). It is compatible with nodes that run the RPC server on a different port, which gaiad status won't pick up by itself without --node. If there are multiple nodes on one machine, this code can end up requesting the status of the node that runs on the default RPC port and using that value of latest_block_height to check if the upgrade should be applied immediately to a different node. It also avoids possible side effects related to outputs and exit codes of the status command. For example, in some earlier versions of CosmosSDK I noticed that this command was printing its output to stderr, not stdout, as did the main logger. Finally, it could resolve your TODO problem related to test binaries :) Of course, I don't know if either of these situations is the culprit behind this specific problem, but refactoring this part of the code to use an RPC call is a good idea anyway and will aid in any further debugging.

pserdiuk-everstake avatar Aug 08 '24 15:08 pserdiuk-everstake

@julienrbrt when can we expect cosmovisor v1.6.0 which I assume will include that fix?

freak12techno avatar Aug 12 '24 20:08 freak12techno

@julienrbrt when can we expect cosmovisor v1.6.0 which I assume will include that fix?

Hey, now: https://github.com/cosmos/cosmos-sdk/releases/tag/tools%2Fcosmovisor%2Fv1.6.0

julienrbrt avatar Aug 12 '24 20:08 julienrbrt