relayer
relayer copied to clipboard
Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago
This issue is typical to Osmosis node and for other nodes didn't detect such problem.
Way to reproduce:
- Start Osmosis node and keep it open for 1 hour
- Start go relayer connecting to Osmosis node
Jan 29 16:40:01 vultr rly[191642]: ts=2024-01-29T16:40:01.948613Z lvl=warn msg="Relayer start error" error="error querying channels: post failed: Post \"http://127.0.0.1:26657\": context deadline exceeded"
Jan 29 16:40:01 vultr rly[191642]: Error: error querying channels: post failed: Post "http://127.0.0.1:26657": context deadline exceeded
The way of fixing this was only restarting Osmosis node and starting go-relayer. The problem here is that, once go-relayer stops, it can't restart automatically without Osmosis node restart.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 600G 279G 296G 49% /
$ free -m
total used free shared buff/cache available
Mem: 32086 11559 586 0 19940 20060
Swap: 8191 2967 5224
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel Core Processor (Broadwell, no TSX, IBRS)
CPU family: 6
Model: 61
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 2
BogoMIPS: 4788.90
Thanks for opening the issue @jununifi.
Can you share what version of Osmosis you are using, as well as the relayer version? It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.
in the past, I have had luck changing this setting in config.toml
when dealing with Context deadline exceed
.
[rpc]
timeout_broadcast_tx_commit = "120s"
Thanks for opening the issue @jununifi.
Can you share what version of Osmosis you are using, as well as the relayer version? It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.
At the moment, we are using v22.0.1 version of Osmosis. But this used to happen in v20.x.x as well. The context deadline exceed issue would be related to memory consumption?
in the past, I have had luck changing this setting in
config.toml
when dealing withContext deadline exceed
.[rpc] timeout_broadcast_tx_commit = "120s"
Thanks for your information, we will try updating the configuration. We are using v2.3.1
version of go-relayer.
Thanks for opening the issue @jununifi. Can you share what version of Osmosis you are using, as well as the relayer version? It may also be helpful if we can see how much memory the Osmosis node is using at startup when the relayer seems to work fine, as well as the memory consumption of the Osmosis node at the time you start to see this issue.
At the moment, we are using v22.0.1 version of Osmosis. But this used to happen in v20.x.x as well. The context deadline exceed issue would be related to memory consumption?
Thanks for the additional information! My coworker had the thought that since the issue only occurs after the Osmosis node has been running for some time, perhaps there is a memory leak in the Osmosis binary that causes a longer than usual response when the relayer attempts to query the channels on startup
I experienced this same issue, but with cosmoshub
rather than osmosis.
One thing I'd argue is that this shouldn't close down the rly
process. Stop attempting to run against that node, sure, but why kill the entire process which has other working channels?
I experienced this same issue, but with
cosmoshub
rather than osmosis.One thing I'd argue is that this shouldn't close down the
rly
process. Stop attempting to run against that node, sure, but why kill the entire process which has other working channels?
interesting 🤔
i do agree that the current behavior is a bit extreme. i think addressing #1268 by only killing the PathProcessor
related to the problematic node is a much better design and should be something we address in the next release. implementing support for multiple RPC endpoints to be configured would help alleviate some of this as well
One thing that we noticed is that we were using a single 60s
timeout for the entire paginated query used for querying all channels. For chains like Osmosis and Cosmos Hub, that have a ton of channels in state, this could potentially be problematic. In #1395 I've made changes to use a 10s
timeout per paginated query request so this should alleviate some of the issues.
Will also prioritize only killing PathProcessors
related to a problematic node vs. killing the entire rly
process.
this should be resolved in 2.5.1
@akc2267 @jtieri we still have this constantly even on 2.5.1
Both larger config and smaller config instances just go to an infinite loop where it reports:
2024-02-22T01:11:44.059393Z warn Relayer start error {"error": "error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path"}
Error: error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path```
This it reports constantly for 10s of channels and chains. it stalls after ~1-2s and restarts the process going into the same error.
Our configs are correct and we can curl the URLs from the server proving it has access to the nodes
funny thing is though that although its in a restart loop, we are processing packets.
@akc2267 @jtieri we still have this constantly even on 2.5.1
Both larger config and smaller config instances just go to an infinite loop where it reports:
2024-02-22T01:11:44.059393Z warn Relayer start error {"error": "error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path"} Error: error querying channels: rpc error: code = Unknown desc = Ibc Error: Invalid query path``` This it reports constantly for 10s of channels and chains. it stalls after ~1-2s and restarts the process going into the same error. Our configs are correct and we can curl the URLs from the server proving it has access to the nodes
I would not expect to see these Ibc Error: Invalid query path
logs for the originally described issue, this seems to be an entirely different issue altogether.
Do you know which chain is reporting these errors? I will likely need to do some local testing to sus out what the issue is with regards to that specific error
@jtieri it is indeed a separate issues, smaller configs now start.
The error logging shows no exact chain though which is causing this and i cant seem to distill that without doing an elimination test on the config which supports > 50 chains.
happy to share the config though privately if that helps