grpc-java Support grpc readiness probing use case

Is your feature request related to a problem?

We have a scenario where we are starting a VM and would like to detect that a gRPC service on the VM is up. We want to have the communication to be directed from the starting host to the VM rather than the other way round because many VMs are being started at the same time and we want to avoid overloading the host initiating the creations while parallelizing the VM creation as much as possible. Therefore we would like to have a way using gRPC to check if the gRPC connection is possible while minimizing both the resources spent on the initiating side for polling and the time between the service being up and the connection being established.

Describe the solution you'd like

We were expecting to rely on the underlying gRPC TCP reconnection framework. Using the channel.getState and channel.notifyWhenStateChanged we are able to implement the polling. However the 2 minute maximum backoff is slightly larger than we would like - effectively we would like a fairly steady rate of polling.

Describe alternatives you've considered

We can mostly reimplement our own backoff policy by calling resetConnectBackoff() at appropriate times but that feels like abusing the API. We can also implement the readiness checking outside of gRPC but it seems that most of the infrastructure is already there so it would be nice to be able to reuse it.

Additional context

I'm aware of https://github.com/grpc/grpc-java/issues/9353 and https://github.com/grpc/grpc-java/issues/10932. That being said the latter issue had a slightly different use case and in the first issue it seemed that the main blocker was a lack of clear motivating problem - here I'm presenting one so I'd like to get feedback whether this idea would make sense for this use case.

Jun 16 '25 15:06 sfc-gh-kyurtsever

You said you "want to have the communication to be directed from the starting host to the VM rather than the other way round" but also said "we want to avoid overloading the host initiating the creations". If gRPC client is on the starting host then the host is the entity that would be creating the connections. Can you clarify?

Jun 17 '25 15:06 kannanjgithub

Sorry for the confusion - "want to have the communication to be directed from the starting host to the VM rather than the other way round" - the "starting host" is the host starting the VM. I'll try to rephrase this more clearly.

We have a host A which starts a number of VMs - let's call them B1, B2, B3, ... and so on. Host A wants to detect when a gRPC service Foo is ready on each of the B1, B2, B3, ... hosts. We could solve this by having B1, B2, B3, ... call back the A host once they are ready but we want to avoid the risk of overloading the A host. Therefore we would like to have a polling mechanism so that host A can quickly detect for each B1, B2, B3, ... host if service Foo on the host is ready.

Does that help?

Jun 18 '25 07:06 sfc-gh-kyurtsever

In your case the average amount of time (1 minute) is probably less useful, because you are waiting until N VMs are started which means the tail will dominate.

However the 2 minute maximum backoff is slightly larger than we would like

"Slightly," like 1 minute? You're expecting the VMs to take longer than 6+ minutes to start? It takes that long before the 2 minute maximum makes a difference.

effectively we would like a fairly steady rate of polling.

So would everyone else with a client, except the servers really wish they wouldn't. The problem with giving you that option is others will use it as well. Given what people have already requested for, whatever we offer here will be abused to its fullest extent.

I mentioned in the past that adjusting the multiplier is less of a concern. I don't think the maximum delay is the important part. It is easy to think about, but if you're expecting the connection to be unavailable for minutes, it seems you should be looking at this as a percentage of the overall delay. gRPC contributing a 2 minute delay is quite different if the host has been unavailable for 10 seconds vs a week. So I'll probably be looking at the delay as a percentage of how long it takes a VM to start.

Jun 20 '25 17:06 ejona86

In your case the average amount of time (1 minute) is probably less useful, because you are waiting until N VMs are started which means the tail will dominate.

The tail doesn’t dominate because every VM that we deem ready immediately goes into service.

"Slightly," like 1 minute? You're expecting the VMs to take longer than 6+ minutes to start? It takes that long before the 2 minute maximum makes a difference.

We are capping the wait at 5 minutes. Lower maximum backoff would help if it were low enough but indeed it’s not the only option and modifying the multiplier also works.

The problem that we are facing is that in the current configuration in the range that we are most likely to have the VM be available (depending on the environment we are looking at ranges between 15s and 2-3 minutes) the delays caused by the backoff would be quite large also as a percentage.

I’ve compiled the comparison of wait times for the current values (based on code, ignoring jitter for simplicity) and for the values if the multiplier was lowered to e.g. 1.1:

Round	Effective backoff	Cumulative wait time	Backoff as percentage of cumulative	Effective backoff with lower multiplier	Cumulative wait time	Backoff as percentage of cumulative
1	1	0	-	1	0	-
2	1.6	1	160.00%	1.1	1	110.00%
3	2.56	2.6	98.46%	1.21	2.1	57.62%
4	4.096	5.16	79.38%	1.331	3.31	40.21%
5	6.5536	9.256	70.80%	1.4641	4.641	31.55%
6	10.48576	15.8096	66.33%	1.61051	6.1051	26.38%
7	16.777216	26.29536	63.80%	1.771561	7.71561	22.96%
8	26.8435456	43.072576	62.32%	1.9487171	9.487171	20.54%
9	42.94967296	69.9161216	61.43%	2.14358881	11.4358881	18.74%
10	68.71947674	112.8657946	60.89%	2.357947691	13.57947691	17.36%
11	109.9511628	181.5852713	60.55%	2.59374246	15.9374246	16.27%
12	120	291.5364341	41.16%	2.853116706	18.53116706	15.40%
13	120	411.5364341	29.16%	3.138428377	21.38428377	14.68%
14	120	531.5364341	22.58%	3.452271214	24.52271214	14.08%
15	120	651.5364341	18.42%	3.797498336	27.97498336	13.57%
16	120	771.5364341	15.55%	4.177248169	31.77248169	13.15%
17	120	891.5364341	13.46%	4.594972986	35.94972986	12.78%
18	120	1011.536434	11.86%	5.054470285	40.54470285	12.47%
19	120	1131.536434	10.61%	5.559917313	45.59917313	12.19%
20	120	1251.536434	9.59%	6.115909045	51.15909045	11.95%
21	120	1371.536434	8.75%	6.727499949	57.27499949	11.75%
22	120	1491.536434	8.05%	7.400249944	64.00249944	11.56%
23	120	1611.536434	7.45%	8.140274939	71.40274939	11.40%
24	120	1731.536434	6.93%	8.954302433	79.54302433	11.26%
25	120	1851.536434	6.48%	9.849732676	88.49732676	11.13%
26	120	1971.536434	6.09%	10.83470594	98.34705943	11.02%
27	120	2091.536434	5.74%	11.91817654	109.1817654	10.92%
28	120	2211.536434	5.43%	13.10999419	121.0999419	10.83%
29	120	2331.536434	5.15%	14.42099361	134.2099361	10.75%
30	120	2451.536434	4.89%	15.86309297	148.6309297	10.67%
31	120	2571.536434	4.67%	17.44940227	164.4940227	10.61%
32	120	2691.536434	4.46%	19.1943425	181.943425	10.55%
33	120	2811.536434	4.27%	21.11377675	201.1377675	10.50%
34	120	2931.536434	4.09%	23.22515442	222.2515442	10.45%
35	120	3051.536434	3.93%	25.54766986	245.4766986	10.41%
36	120	3171.536434	3.78%	28.10243685	271.0243685	10.37%
37	120	3291.536434	3.65%	30.91268053	299.1268053	10.33%

The percentage of wait time is really high in the 1-300s range for the 1.6 multiplier. We probably would like being able to go even lower to e.g. 1.05.

I understand that having such a feature may lead to misuse. However, I suspect that without the ability to control the polling frequency people will just hack their own polling sidecars which will benefit no one.

Jul 07 '25 21:07 sfc-gh-kyurtsever

I had played with the data myself earlier and was sort of thinking 1.1 or 1.05 were the likely lowest points. So the fact that you reached a similar conclusion seems good.

I'm tempted to go with 1.1, as it is close to 10% for most of the range before hitting 120. Earlier in the range, it is only ~5s before hitting ~10%, which seems quite fair.

I think 1.05 is the absolute bare minimum that I might consider. It takes over 7 minutes before the backoff exceeds the CONNECT_TIMEOUT=20s, which seems like a really long time.

One thing to understand is that connect() itself has an exponential backoff with a multiplier of 2 that you are likely to hit, since the VM can easily be a black hole during part of its booting. So you'll probably be seeing ~8s of backoffs even in the first connection attempt just within the kernel.

Jul 07 '25 22:07 ejona86