orleans
orleans copied to clipboard
Fixed bug where pending requests are lost on disconnect
Keep track of pending requests and make sure on connection disconnect they get processed properly otherwise the client making the request hangs until timeout
We hit this during deploy where one of our web services calls into a grain to load settings at the same time we are also rolling silos and thus gets disconnected and has to wait for the 30 second timeout instead of retrying straight away.
Does this PR fix your issue? I believe it will be unreliable in cases where there is more than one connection between each endpoint, or cases where a response is returned via a different gateway (client -req-> A -req-> B -resp-> client).
A more general approach might be to operate at a higher level and catch disconnection events in GatewayManager or similar, though that still suffers from the CABC flaw above (eg, where the connection to A drops after successfully sending the request to B).
@benjaminpetit may have some thoughts.
So in the case of rolling restarts and a singleton grain it does as from what i know the non silo client sends request directly to the silo with the grain and that silo is restarting (thus no response ever) and currently sits there waiting until timeout. However, i dont know how orleans routes responses besides this.
Yeah this change is definitely not correct as-is. There is no guarantee that replies will arrive back through though the same Connection object as a request. For silo-to-silo grain calls, that is actually not the normal case. The normal case is requests and replies using different Connection objects. Requests use a connection made inbound, and responses use one made outbound.
Also clients don't connect directly to the silo that hosts the grain. They connect to several gateway silos, and send messages to them randomly. The gateways then forward messages to the real hosting silo (which is sometimes itself, but not always).
An effort is made to route responses back through the same gateway as the request (although if the target grain is also a gateway that the client is connected to, the response might return though it instead to save a hop).
When a silo shuts down gracefully, any requests being processed is given the chance to complete, and the response gets a chance to leave the silo. If there are additional messages queued up, a new activation is requested (which will occur on a different silo), and all the queued messages are forwarded.
So assuming the silo has fully shut down gracefully, replies are not supposed to get lost. But there might be some edge cases not handled. Indeed, I think i found such a case.
If a grain on a non-gateway silo tries to reply to a client but in the interim the Gateway used for the request has shut down, I don't see any code that would enable the message to be re-routed to a different gateway. So I think this would result in a lost reply, and the client needing to timeout. In theory this could be handled by looking up the ClientId in the directory to find another gateway the client is connected to. (Like when routing to a client addressable.)
We did have a problem with the silos not being shutdown correctly so that might be causing the root of the issue. Good to know more of the internal details of how this is meant to work and happy to close out this pr.