Dial EC endpoints with a timeout in ExecutionClientManager
What: Use a 5-second context timeout when dialing primary and fallback EC URLs to avoid blocking indefinitely.
Why: Prevents startup hangs if an endpoint is down or unresponsive.
How:
Introduce dialWithTimeout(url string) helper that utilizes ethclient.DialContext(..) and apply to both primary and fallback dials in NewExecutionClientManager.
So, I talked to fornax a bit about this offline, but this is sort of a bandaid that won't actually fix the larger problem- that we abstracted away the context fields in all of these requests within the client.
When the rocketpool binary starts, we should create a cancelable parent context for all actions, and make children using WithTimeout. This would be an enormous refactor, but it would mean that context-aware timeouts would be possible.
5 seconds seems too short for a global dial timeout. ECs that are in the middle of syncing, for example, are still dialed to get sync status, but may take more than 5 seconds to reply.
I think it's a little too dangerous to apply a unilateral 5 second timeout here.
I agree that a global context would make a lot of sense. Would it be reasonable to have a (longer) timeout for the fallback client only? Feels a bit strange to have the UI not respond because the fallback EC has some issues.
I agree that a global context would make a lot of sense. Would it be reasonable to have a (longer) timeout for the fallback client only? Feels a bit strange to have the UI not respond because the fallback EC has some issues.
Sure, that would make sense. I think at some point I also want to add a feature that treats the fallback ec as the primary for queries in rocketpool_node and rocketpool_api so that things like collecting metrics don't use the primary ec if a fallback is available, but adding a timeout shouldn't really hurt that use-case.