namada icon indicating copy to clipboard operation
namada copied to clipboard

IBC client gets suddenly frozen despite of being constantly updated

Open McDaan opened this issue 10 months ago • 4 comments

I have already reported a similar (or perhaps related) behavior internally one month ago or so. It looks like something is broken with IBC clients statuses on Namada's side. Frozen height isn't even emitted correctly, so I assume that it's somehow related to the IBC client that suddenly starts not being properly updated at some point. I have no clue what may cause this, though.

hermes query client status --chain shielded-expedition.88f17d1d14 --client 07-tendermint-3546
SUCCESS Frozen
hermes query client state --chain shielded-expedition.88f17d1d14 --client 07-tendermint-3546
SUCCESS Tendermint(
    ClientState {
        chain_id: ChainId {
            id: "theta-testnet-001",
            version: 0,
        },
        trust_threshold: TrustThreshold(
            Ratio {
                numer: 2,
                denom: 3,
            },
        ),
        trusting_period: 115200s,
        unbonding_period: 172800s,
        max_clock_drift: 50s,
        latest_height: Height {
            revision: 0,
            height: 21020900,
        },
        proof_specs: ProofSpecs(
            [
                ProofSpec {
                    leaf_spec: Some(
                        LeafOp {
                            hash: Sha256,
                            prehash_key: NoHash,
                            prehash_value: Sha256,
                            length: VarProto,
                            prefix: [
                                0,
                            ],
                        },
                    ),
                    inner_spec: Some(
                        InnerSpec {
                            child_order: [
                                0,
                                1,
                            ],
                            child_size: 33,
                            min_prefix_length: 4,
                            max_prefix_length: 12,
                            empty_child: [],
                            hash: Sha256,
                        },
                    ),
                    max_depth: 0,
                    min_depth: 0,
                    prehash_key_before_comparison: false,
                },
                ProofSpec {
                    leaf_spec: Some(
                        LeafOp {
                            hash: Sha256,
                            prehash_key: NoHash,
                            prehash_value: Sha256,
                            length: VarProto,
                            prefix: [
                                0,
                            ],
                        },
                    ),
                    inner_spec: Some(
                        InnerSpec {
                            child_order: [
                                0,
                                1,
                            ],
                            child_size: 32,
                            min_prefix_length: 1,
                            max_prefix_length: 1,
                            empty_child: [],
                            hash: Sha256,
                        },
                    ),
                    max_depth: 0,
                    min_depth: 0,
                    prehash_key_before_comparison: false,
                },
            ],
        ),
        upgrade_path: [
            "upgrade",
            "upgradedIBCState",
        ],
        allow_update: AllowUpdate {
            after_expiry: true,
            after_misbehaviour: true,
        },
        frozen_height: Some(
            Height {
                revision: 0,
                height: 1,
            },
        ),
    },
)

Opening this issue on both Hermes Heliax fork and Namada repositories since I'm not fully sure which of them fits better to report this bug.

McDaan avatar Apr 02 '24 16:04 McDaan

See https://github.com/heliaxdev/hermes/issues/27#issuecomment-2035442593

zenodeapp avatar Apr 03 '24 19:04 zenodeapp

Thank you so much for your reporting and investigation. They're super helpful. It's caused by ibc-rs' bug ibc-rs/#1080 (It was fixed in v0.51.0).

An IBC client could be updated by multiple messages. In Namada, each message is executed one by one, but the order isn't sequential. So, a client could be updated with a lower height. The IBC consensus state of the latest height is overwritten by the one of the lower height due to the bug. After the overwriting, the following update process would detect that the consensus state's timestamp was unexpected. For example, the client is updated to height 900, then updated to height 896. The IBC consensus state for height 900 is overwritten instead of that of 896. The next update to height 898 checks the "next" consensus state. In this case, that of 900 is checked and the timestamp should be later than that of 898. However, because the consensus state is that of 896, the time is earlier than 898. It is detected as a misbehavior and the client is frozen.

Hermes retries transactions when they're rejected by the destination chain (e.g. trying to receive a token to znam without memo). Before each retry, a client update is requested. I think that many client updates contribute to the occurrence of the issue.

We need to update ibc-rs in Namada to resolve the issue. Also, I think there is a workaround:

  • Set trusted_node true for the counterparty chain in Hermes config

This workaround would work for a single Hermes. By this config, only one update message is submitted per request. It does not perfectly avoid this issue when using multiple Hermes for a single channel (And, this is an experimental parameter).

yito88 avatar Apr 05 '24 15:04 yito88

Thank you so much for your reporting and investigation. They're super helpful. It's caused by ibc-rs' bug ibc-rs/#1080 (It was fixed in v0.51.0).

An IBC client could be updated by multiple messages. In Namada, each message is executed one by one, but the order isn't sequential. So, a client could be updated with a lower height. The IBC consensus state of the latest height is overwritten by the one of the lower height due to the bug. After the overwriting, the following update process would detect that the consensus state's timestamp was unexpected. For example, the client is updated to height 900, then updated to height 896. The IBC consensus state for height 900 is overwritten instead of that of 896. The next update to height 898 checks the "next" consensus state. In this case, that of 900 is checked and the timestamp should be later than that of 898. However, because the consensus state is that of 896, the time is earlier than 898. It is detected as a misbehavior and the client is frozen.

Hermes retries transactions when they're rejected by the destination chain (e.g. trying to receive a token to znam without memo). Before each retry, a client update is requested. I think that many client updates contribute to the occurrence of the issue.

We need to update ibc-rs in Namada to resolve the issue. Also, I think there is a workaround:

  • Set trusted_node true for the counterparty chain in Hermes config

This workaround would work for a single Hermes. By this config, only one update message is submitted per request. It does not perfectly avoid this issue when using multiple Hermes for a single channel (And, this is an experimental parameter).

Awesome! That makes sense! Glad it got solved!

Reading your explanation there's probably more ways to recreate this issue. But it's probably best for us to try and mitigate it as much as possible, starting with the experimental parameter. I'll also prevent users on my frontend to accidentally send to a znam address without a memo now.

Thanks for getting back to us and your detailed explanation!

zenodeapp avatar Apr 05 '24 20:04 zenodeapp

FYI: We won't use the ICS-20 memo field for a shielding transfer in the next breaking change. The change was in #2631. Our updated Hermes will request this shielding transfer info to the destination Namada and set it to a Namada transaction, i.e. a user doesn't care about the shielding info and just sets the receiver znam address as a receiver. (This is not a solution for this issue since anyone can still request invalid transactions.

yito88 avatar Apr 05 '24 20:04 yito88