tower::Balance does not notice if multiple become ready at once
tower-0.5.2
Platform: linux
In the process of using tonic, I found that its dynamic endpoint uses tower. I found a problem in actual use. When <Balance as Service>::poll_ready is triggered, Balance::promote_pending_to_ready=>ReadyCache::poll_pending=>futures_util::stream::futures_unordered::FuturesUnordered::poll_next has an issue: assuming there are multiple endpoints, only one Poll::Ready(Some(Ok((key, svc, cancel_rx)))) will be returned here, and the others are Poll::Pending, which makes the balance function invalid.
Hey there! Thanks for the report. Hmm, let's try to improve it some, so we can better find what the root thing is.
What exactly doesn't work? It certainly works in many cases. What you describe about the ready cache sounds fine to me... Are you saying that the problem is that it can only return 1, even if multiple might have been ready at the same time?
Hey there! Thanks for the report. Hmm, let's try to improve it some, so we can better find what the root thing is.
What exactly doesn't work? It certainly works in many cases. What you describe about the ready cache sounds fine to me... Are you saying that the problem is that it can only return 1, even if multiple might have been ready at the same time?
Yes, as you said, multiple ready endpoints should be returned, but in fact only one can be returned, and the others are pending.
It seems that When the Change::Insert(Key, Svc) svc poll_ready phase contains some I/O or polling op, FuturesUnordered::poll_next does not work as expected.
I think I know the reason now.
I'm using Tonic, and when the client initiates a req,
Buffer::Worker::poll poll_next_msg will be followed by self.service.poll_ready calling balance::poll_ready. Let's say there are multiple endpoints, one of which is ready, and the other is in pending, and self.service.poll_ready still returns poll:: Ready(Ok(())), the problem here is that if no next req arrives, the endpoint that is being pending will not have a chance to be polled even if it becomes ready.
This is fine most of the time, but if there is complex IO logic in the SVC's poll_ready (similar to a server-side active disconnection), the ready endpoint will not respond for a long time and will be disconnected. Of course, this can be compatible with this kind of case by implementing the corresponding logic in the SVC poll_ready, but this may not be the part of Tower that is of interest.
From my point of view, there are two solutions to fix this issue:
- Tower handles the wake-up logic of multiple endpoint pending in buffer/worker or balance, and is not affected by
worker::p oll_next_msg; - The user of the tower (i.e. tonic) identifies the error in the
State::Connectingpoll stage in the 'poll_ready' of svc (corresponding to the tonic should beReconnect), here I have actually tested it and unfortunately it does not return a Err in the case, maybe it is difficult to do this;
I've seen similar issues to this in production that I've been trying to track down, and I believe this is the same issue.
In my case I have a gRPC client (tonic) that is using client-side load balancing (tower::Balance) with multiple endpoints. If there is very little traffic on the client, I run into an issue where the first request will start connecting to 2 new endpoints (due to the p2c load balancing). Once the first connection is successful, the request is sent and the 2nd just stalls. If the connections are using TLS, this stall seems to happen after the TCP handshake, and in the middle of the TLS handshake (ClientHello -> ServerHello -> nothing). Since the 2nd connection isn't fully established yet, this prevents keep-alives from being sent and the server eventually kills it. Later when another request is sent and selects the 2nd connection, it causes the request to fail with tls handshake eof (which caused a non-trivial amount of TLS config debugging that went nowhere 😅 ).
I've confirmed that the fix for this issue also fixes the issue I'm seeing, and I would like to encourage it to be revisited / merged in.