grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

Support grpc readiness probing use case

Open sfc-gh-kyurtsever opened this issue 10 months ago • 5 comments

Is your feature request related to a problem?

We have a scenario where we are starting a VM and would like to detect that a gRPC service on the VM is up. We want to have the communication to be directed from the starting host to the VM rather than the other way round because many VMs are being started at the same time and we want to avoid overloading the host initiating the creations while parallelizing the VM creation as much as possible. Therefore we would like to have a way using gRPC to check if the gRPC connection is possible while minimizing both the resources spent on the initiating side for polling and the time between the service being up and the connection being established.

Describe the solution you'd like

We were expecting to rely on the underlying gRPC TCP reconnection framework. Using the channel.getState and channel.notifyWhenStateChanged we are able to implement the polling. However the 2 minute maximum backoff is slightly larger than we would like - effectively we would like a fairly steady rate of polling.

Describe alternatives you've considered

We can mostly reimplement our own backoff policy by calling resetConnectBackoff() at appropriate times but that feels like abusing the API. We can also implement the readiness checking outside of gRPC but it seems that most of the infrastructure is already there so it would be nice to be able to reuse it.

Additional context

I'm aware of https://github.com/grpc/grpc-java/issues/9353 and https://github.com/grpc/grpc-java/issues/10932. That being said the latter issue had a slightly different use case and in the first issue it seemed that the main blocker was a lack of clear motivating problem - here I'm presenting one so I'd like to get feedback whether this idea would make sense for this use case.

sfc-gh-kyurtsever avatar Jun 16 '25 15:06 sfc-gh-kyurtsever

You said you "want to have the communication to be directed from the starting host to the VM rather than the other way round" but also said "we want to avoid overloading the host initiating the creations". If gRPC client is on the starting host then the host is the entity that would be creating the connections. Can you clarify?

kannanjgithub avatar Jun 17 '25 15:06 kannanjgithub

Sorry for the confusion - "want to have the communication to be directed from the starting host to the VM rather than the other way round" - the "starting host" is the host starting the VM. I'll try to rephrase this more clearly.

We have a host A which starts a number of VMs - let's call them B1, B2, B3, ... and so on. Host A wants to detect when a gRPC service Foo is ready on each of the B1, B2, B3, ... hosts. We could solve this by having B1, B2, B3, ... call back the A host once they are ready but we want to avoid the risk of overloading the A host. Therefore we would like to have a polling mechanism so that host A can quickly detect for each B1, B2, B3, ... host if service Foo on the host is ready.

Does that help?

sfc-gh-kyurtsever avatar Jun 18 '25 07:06 sfc-gh-kyurtsever

In your case the average amount of time (1 minute) is probably less useful, because you are waiting until N VMs are started which means the tail will dominate.

However the 2 minute maximum backoff is slightly larger than we would like

"Slightly," like 1 minute? You're expecting the VMs to take longer than 6+ minutes to start? It takes that long before the 2 minute maximum makes a difference.

effectively we would like a fairly steady rate of polling.

So would everyone else with a client, except the servers really wish they wouldn't. The problem with giving you that option is others will use it as well. Given what people have already requested for, whatever we offer here will be abused to its fullest extent.

I mentioned in the past that adjusting the multiplier is less of a concern. I don't think the maximum delay is the important part. It is easy to think about, but if you're expecting the connection to be unavailable for minutes, it seems you should be looking at this as a percentage of the overall delay. gRPC contributing a 2 minute delay is quite different if the host has been unavailable for 10 seconds vs a week. So I'll probably be looking at the delay as a percentage of how long it takes a VM to start.

ejona86 avatar Jun 20 '25 17:06 ejona86

In your case the average amount of time (1 minute) is probably less useful, because you are waiting until N VMs are started which means the tail will dominate.

The tail doesn’t dominate because every VM that we deem ready immediately goes into service.

"Slightly," like 1 minute? You're expecting the VMs to take longer than 6+ minutes to start? It takes that long before the 2 minute maximum makes a difference.

We are capping the wait at 5 minutes. Lower maximum backoff would help if it were low enough but indeed it’s not the only option and modifying the multiplier also works.

The problem that we are facing is that in the current configuration in the range that we are most likely to have the VM be available (depending on the environment we are looking at ranges between 15s and 2-3 minutes) the delays caused by the backoff would be quite large also as a percentage.

I’ve compiled the comparison of wait times for the current values (based on code, ignoring jitter for simplicity) and for the values if the multiplier was lowered to e.g. 1.1:

Round Effective backoff Cumulative wait time Backoff as percentage of cumulative Effective backoff with lower multiplier Cumulative wait time Backoff as percentage of cumulative
1 1 0 - 1 0 -
2 1.6 1 160.00% 1.1 1 110.00%
3 2.56 2.6 98.46% 1.21 2.1 57.62%
4 4.096 5.16 79.38% 1.331 3.31 40.21%
5 6.5536 9.256 70.80% 1.4641 4.641 31.55%
6 10.48576 15.8096 66.33% 1.61051 6.1051 26.38%
7 16.777216 26.29536 63.80% 1.771561 7.71561 22.96%
8 26.8435456 43.072576 62.32% 1.9487171 9.487171 20.54%
9 42.94967296 69.9161216 61.43% 2.14358881 11.4358881 18.74%
10 68.71947674 112.8657946 60.89% 2.357947691 13.57947691 17.36%
11 109.9511628 181.5852713 60.55% 2.59374246 15.9374246 16.27%
12 120 291.5364341 41.16% 2.853116706 18.53116706 15.40%
13 120 411.5364341 29.16% 3.138428377 21.38428377 14.68%
14 120 531.5364341 22.58% 3.452271214 24.52271214 14.08%
15 120 651.5364341 18.42% 3.797498336 27.97498336 13.57%
16 120 771.5364341 15.55% 4.177248169 31.77248169 13.15%
17 120 891.5364341 13.46% 4.594972986 35.94972986 12.78%
18 120 1011.536434 11.86% 5.054470285 40.54470285 12.47%
19 120 1131.536434 10.61% 5.559917313 45.59917313 12.19%
20 120 1251.536434 9.59% 6.115909045 51.15909045 11.95%
21 120 1371.536434 8.75% 6.727499949 57.27499949 11.75%
22 120 1491.536434 8.05% 7.400249944 64.00249944 11.56%
23 120 1611.536434 7.45% 8.140274939 71.40274939 11.40%
24 120 1731.536434 6.93% 8.954302433 79.54302433 11.26%
25 120 1851.536434 6.48% 9.849732676 88.49732676 11.13%
26 120 1971.536434 6.09% 10.83470594 98.34705943 11.02%
27 120 2091.536434 5.74% 11.91817654 109.1817654 10.92%
28 120 2211.536434 5.43% 13.10999419 121.0999419 10.83%
29 120 2331.536434 5.15% 14.42099361 134.2099361 10.75%
30 120 2451.536434 4.89% 15.86309297 148.6309297 10.67%
31 120 2571.536434 4.67% 17.44940227 164.4940227 10.61%
32 120 2691.536434 4.46% 19.1943425 181.943425 10.55%
33 120 2811.536434 4.27% 21.11377675 201.1377675 10.50%
34 120 2931.536434 4.09% 23.22515442 222.2515442 10.45%
35 120 3051.536434 3.93% 25.54766986 245.4766986 10.41%
36 120 3171.536434 3.78% 28.10243685 271.0243685 10.37%
37 120 3291.536434 3.65% 30.91268053 299.1268053 10.33%

The percentage of wait time is really high in the 1-300s range for the 1.6 multiplier. We probably would like being able to go even lower to e.g. 1.05.

I understand that having such a feature may lead to misuse. However, I suspect that without the ability to control the polling frequency people will just hack their own polling sidecars which will benefit no one.

sfc-gh-kyurtsever avatar Jul 07 '25 21:07 sfc-gh-kyurtsever

I had played with the data myself earlier and was sort of thinking 1.1 or 1.05 were the likely lowest points. So the fact that you reached a similar conclusion seems good.

I'm tempted to go with 1.1, as it is close to 10% for most of the range before hitting 120. Earlier in the range, it is only ~5s before hitting ~10%, which seems quite fair.

I think 1.05 is the absolute bare minimum that I might consider. It takes over 7 minutes before the backoff exceeds the CONNECT_TIMEOUT=20s, which seems like a really long time.

One thing to understand is that connect() itself has an exponential backoff with a multiplier of 2 that you are likely to hit, since the VM can easily be a black hole during part of its booting. So you'll probably be seeing ~8s of backoffs even in the first connection attempt just within the kernel.

ejona86 avatar Jul 07 '25 22:07 ejona86