relayer RLY fails if one node is unavailable

When a node is unreachable, the entire rly process restarts even though there are other channels being covered. It seems like that channel should be passed on, rather than the entire service being ended.

2023-08-28T03:11:56.839633Z	error	Failed to query node status	{"chain_name": "lumnetwork", "chain_id": "lum-network-1", "attempt": 5, "max_attempts": 5, "error": "failed to query node status: post failed: Post \"http://ip_here\": context deadline exceeded"}
2023-08-28T03:11:56.839657Z	error	Failed to query latest height after max attempts	{"chain_name": "lumnetwork", "chain_id": "lum-network-1", "attempts": 5, "error": "failed to query node status: post failed: Post \"http://ip_here\": context deadline exceeded"}
2023-08-28T03:11:56.840208Z	error	Failed to query latest height after max attempts	{"chain_name": "nolus", "chain_id": "pirin-1", "attempts": 5, "error": "context canceled"}
rly.service: Deactivated successfully.
rly.service: Consumed 1min 8.125s CPU time.
rly.service: Scheduled restart job, restart counter is at 1.
Stopped RLY IBC relayer for mainnet.
rly.service: Consumed 1min 8.125s CPU time.
Started RLY IBC relayer for mainnet.

Aug 28 '23 03:08 dylanschultzie

thanks for opening this issue, i agree that restarting because one node is unreachable doesn't seem like desirable behavior. i'll discuss this internally and see how the team wants to prioritize this, i may possibly be able to take this on in our next sprint.

Oct 11 '23 19:10 jtieri

Is there any update on this? Or any config that can be set to prevent it to restart?

Jul 11 '24 14:07 tiagocmachado

Is there any update on this? Or any config that can be set to prevent it to restart?

I started working on a PoC for this awhile back at this point but got pulled away to work on some other stuff.

Recently one of the engineers on our team revisited this issue but was struggling to get the rly process to crash due to one node being unavailable. He said, "I am having a very difficult time figuring out how to make the Chain Processor error out and crash the application. Even if the chain is configured with an invalid node endpoint, it will just keep trying and trying; It never crashes. I've looked into the code, and as of right now, the only time it will fully error out is when there is a stuck packet that doesn't get resolved: https://github.com/cosmos/relayer/blob/df42391dd3ab04fce238adb7b4112d7bd10fa63c/relayer/chains/cosmos/cosmos_chain_processor.go#L486-L496"

@joelsmith-2019 does this sound correct?

If we can confirm that this behavior is still present and we can replicate it locally in testing then we should be able to find someone who can take this on sooner rather than later to get things refactored into a state that results in more desirable behavior.

Jul 17 '24 17:07 jtieri

@jtieri - Yes, that does sound correct.

Jul 17 '24 17:07 joelsmith-2019

Is there any update on this? Or any config that can be set to prevent it to restart?

I started working on a PoC for this awhile back at this point but got pulled away to work on some other stuff.

Recently one of the engineers on our team revisited this issue but was struggling to get the rly process to crash due to one node being unavailable. He said, "I am having a very difficult time figuring out how to make the Chain Processor error out and crash the application. Even if the chain is configured with an invalid node endpoint, it will just keep trying and trying; It never crashes. I've looked into the code, and as of right now, the only time it will fully error out is when there is a stuck packet that doesn't get resolved:

https://github.com/cosmos/relayer/blob/df42391dd3ab04fce238adb7b4112d7bd10fa63c/relayer/chains/cosmos/cosmos_chain_processor.go#L486-L496

" @joelsmith-2019 does this sound correct?

If we can confirm that this behavior is still present and we can replicate it locally in testing then we should be able to find someone who can take this on sooner rather than later to get things refactored into a state that results in more desirable behavior.

This could be related to the number of chains and paths we relay.

We faced the issue with 25 chains and 88 paths.

Jul 17 '24 21:07 tiagocmachado