smartnode icon indicating copy to clipboard operation
smartnode copied to clipboard

Dial EC endpoints with a timeout in ExecutionClientManager

Open 0xtrooper opened this issue 7 months ago • 3 comments

What: Use a 5-second context timeout when dialing primary and fallback EC URLs to avoid blocking indefinitely.

Why: Prevents startup hangs if an endpoint is down or unresponsive.

How: Introduce dialWithTimeout(url string) helper that utilizes ethclient.DialContext(..) and apply to both primary and fallback dials in NewExecutionClientManager.

0xtrooper avatar May 21 '25 11:05 0xtrooper

So, I talked to fornax a bit about this offline, but this is sort of a bandaid that won't actually fix the larger problem- that we abstracted away the context fields in all of these requests within the client.

When the rocketpool binary starts, we should create a cancelable parent context for all actions, and make children using WithTimeout. This would be an enormous refactor, but it would mean that context-aware timeouts would be possible.

5 seconds seems too short for a global dial timeout. ECs that are in the middle of syncing, for example, are still dialed to get sync status, but may take more than 5 seconds to reply.

I think it's a little too dangerous to apply a unilateral 5 second timeout here.

jshufro avatar May 27 '25 18:05 jshufro

I agree that a global context would make a lot of sense. Would it be reasonable to have a (longer) timeout for the fallback client only? Feels a bit strange to have the UI not respond because the fallback EC has some issues.

0xtrooper avatar May 28 '25 18:05 0xtrooper

I agree that a global context would make a lot of sense. Would it be reasonable to have a (longer) timeout for the fallback client only? Feels a bit strange to have the UI not respond because the fallback EC has some issues.

Sure, that would make sense. I think at some point I also want to add a feature that treats the fallback ec as the primary for queries in rocketpool_node and rocketpool_api so that things like collecting metrics don't use the primary ec if a fallback is available, but adding a timeout shouldn't really hurt that use-case.

jshufro avatar May 28 '25 18:05 jshufro