RLY fails if one node is unavailable
When a node is unreachable, the entire rly process restarts even though there are other channels being covered. It seems like that channel should be passed on, rather than the entire service being ended.
2023-08-28T03:11:56.839633Z error Failed to query node status {"chain_name": "lumnetwork", "chain_id": "lum-network-1", "attempt": 5, "max_attempts": 5, "error": "failed to query node status: post failed: Post \"http://ip_here\": context deadline exceeded"}
2023-08-28T03:11:56.839657Z error Failed to query latest height after max attempts {"chain_name": "lumnetwork", "chain_id": "lum-network-1", "attempts": 5, "error": "failed to query node status: post failed: Post \"http://ip_here\": context deadline exceeded"}
2023-08-28T03:11:56.840208Z error Failed to query latest height after max attempts {"chain_name": "nolus", "chain_id": "pirin-1", "attempts": 5, "error": "context canceled"}
rly.service: Deactivated successfully.
rly.service: Consumed 1min 8.125s CPU time.
rly.service: Scheduled restart job, restart counter is at 1.
Stopped RLY IBC relayer for mainnet.
rly.service: Consumed 1min 8.125s CPU time.
Started RLY IBC relayer for mainnet.
thanks for opening this issue, i agree that restarting because one node is unreachable doesn't seem like desirable behavior. i'll discuss this internally and see how the team wants to prioritize this, i may possibly be able to take this on in our next sprint.
Is there any update on this? Or any config that can be set to prevent it to restart?
Is there any update on this? Or any config that can be set to prevent it to restart?
I started working on a PoC for this awhile back at this point but got pulled away to work on some other stuff.
Recently one of the engineers on our team revisited this issue but was struggling to get the rly process to crash due to one node being unavailable. He said, "I am having a very difficult time figuring out how to make the Chain Processor error out and crash the application. Even if the chain is configured with an invalid node endpoint, it will just keep trying and trying; It never crashes. I've looked into the code, and as of right now, the only time it will fully error out is when there is a stuck packet that doesn't get resolved: https://github.com/cosmos/relayer/blob/df42391dd3ab04fce238adb7b4112d7bd10fa63c/relayer/chains/cosmos/cosmos_chain_processor.go#L486-L496"
@joelsmith-2019 does this sound correct?
If we can confirm that this behavior is still present and we can replicate it locally in testing then we should be able to find someone who can take this on sooner rather than later to get things refactored into a state that results in more desirable behavior.
@jtieri - Yes, that does sound correct.
Is there any update on this? Or any config that can be set to prevent it to restart?
I started working on a PoC for this awhile back at this point but got pulled away to work on some other stuff.
Recently one of the engineers on our team revisited this issue but was struggling to get the rly process to crash due to one node being unavailable. He said, "I am having a very difficult time figuring out how to make the Chain Processor error out and crash the application. Even if the chain is configured with an invalid node endpoint, it will just keep trying and trying; It never crashes. I've looked into the code, and as of right now, the only time it will fully error out is when there is a stuck packet that doesn't get resolved:
https://github.com/cosmos/relayer/blob/df42391dd3ab04fce238adb7b4112d7bd10fa63c/relayer/chains/cosmos/cosmos_chain_processor.go#L486-L496
" @joelsmith-2019 does this sound correct?
If we can confirm that this behavior is still present and we can replicate it locally in testing then we should be able to find someone who can take this on sooner rather than later to get things refactored into a state that results in more desirable behavior.
This could be related to the number of chains and paths we relay.
We faced the issue with 25 chains and 88 paths.