Support grpc readiness probing use case
Is your feature request related to a problem?
We have a scenario where we are starting a VM and would like to detect that a gRPC service on the VM is up. We want to have the communication to be directed from the starting host to the VM rather than the other way round because many VMs are being started at the same time and we want to avoid overloading the host initiating the creations while parallelizing the VM creation as much as possible. Therefore we would like to have a way using gRPC to check if the gRPC connection is possible while minimizing both the resources spent on the initiating side for polling and the time between the service being up and the connection being established.
Describe the solution you'd like
We were expecting to rely on the underlying gRPC TCP reconnection framework. Using the channel.getState and channel.notifyWhenStateChanged we are able to implement the polling. However the 2 minute maximum backoff is slightly larger than we would like - effectively we would like a fairly steady rate of polling.
Describe alternatives you've considered
We can mostly reimplement our own backoff policy by calling resetConnectBackoff() at appropriate times but that feels like abusing the API. We can also implement the readiness checking outside of gRPC but it seems that most of the infrastructure is already there so it would be nice to be able to reuse it.
Additional context
I'm aware of https://github.com/grpc/grpc-java/issues/9353 and https://github.com/grpc/grpc-java/issues/10932. That being said the latter issue had a slightly different use case and in the first issue it seemed that the main blocker was a lack of clear motivating problem - here I'm presenting one so I'd like to get feedback whether this idea would make sense for this use case.
You said you "want to have the communication to be directed from the starting host to the VM rather than the other way round" but also said "we want to avoid overloading the host initiating the creations". If gRPC client is on the starting host then the host is the entity that would be creating the connections. Can you clarify?
Sorry for the confusion - "want to have the communication to be directed from the starting host to the VM rather than the other way round" - the "starting host" is the host starting the VM. I'll try to rephrase this more clearly.
We have a host A which starts a number of VMs - let's call them B1, B2, B3, ... and so on. Host A wants to detect when a gRPC service Foo is ready on each of the B1, B2, B3, ... hosts. We could solve this by having B1, B2, B3, ... call back the A host once they are ready but we want to avoid the risk of overloading the A host. Therefore we would like to have a polling mechanism so that host A can quickly detect for each B1, B2, B3, ... host if service Foo on the host is ready.
Does that help?
In your case the average amount of time (1 minute) is probably less useful, because you are waiting until N VMs are started which means the tail will dominate.
However the 2 minute maximum backoff is slightly larger than we would like
"Slightly," like 1 minute? You're expecting the VMs to take longer than 6+ minutes to start? It takes that long before the 2 minute maximum makes a difference.
effectively we would like a fairly steady rate of polling.
So would everyone else with a client, except the servers really wish they wouldn't. The problem with giving you that option is others will use it as well. Given what people have already requested for, whatever we offer here will be abused to its fullest extent.
I mentioned in the past that adjusting the multiplier is less of a concern. I don't think the maximum delay is the important part. It is easy to think about, but if you're expecting the connection to be unavailable for minutes, it seems you should be looking at this as a percentage of the overall delay. gRPC contributing a 2 minute delay is quite different if the host has been unavailable for 10 seconds vs a week. So I'll probably be looking at the delay as a percentage of how long it takes a VM to start.
In your case the average amount of time (1 minute) is probably less useful, because you are waiting until N VMs are started which means the tail will dominate.
The tail doesn’t dominate because every VM that we deem ready immediately goes into service.
"Slightly," like 1 minute? You're expecting the VMs to take longer than 6+ minutes to start? It takes that long before the 2 minute maximum makes a difference.
We are capping the wait at 5 minutes. Lower maximum backoff would help if it were low enough but indeed it’s not the only option and modifying the multiplier also works.
The problem that we are facing is that in the current configuration in the range that we are most likely to have the VM be available (depending on the environment we are looking at ranges between 15s and 2-3 minutes) the delays caused by the backoff would be quite large also as a percentage.
I’ve compiled the comparison of wait times for the current values (based on code, ignoring jitter for simplicity) and for the values if the multiplier was lowered to e.g. 1.1:
| Round | Effective backoff | Cumulative wait time | Backoff as percentage of cumulative | Effective backoff with lower multiplier | Cumulative wait time | Backoff as percentage of cumulative |
|---|---|---|---|---|---|---|
| 1 | 1 | 0 | - | 1 | 0 | - |
| 2 | 1.6 | 1 | 160.00% | 1.1 | 1 | 110.00% |
| 3 | 2.56 | 2.6 | 98.46% | 1.21 | 2.1 | 57.62% |
| 4 | 4.096 | 5.16 | 79.38% | 1.331 | 3.31 | 40.21% |
| 5 | 6.5536 | 9.256 | 70.80% | 1.4641 | 4.641 | 31.55% |
| 6 | 10.48576 | 15.8096 | 66.33% | 1.61051 | 6.1051 | 26.38% |
| 7 | 16.777216 | 26.29536 | 63.80% | 1.771561 | 7.71561 | 22.96% |
| 8 | 26.8435456 | 43.072576 | 62.32% | 1.9487171 | 9.487171 | 20.54% |
| 9 | 42.94967296 | 69.9161216 | 61.43% | 2.14358881 | 11.4358881 | 18.74% |
| 10 | 68.71947674 | 112.8657946 | 60.89% | 2.357947691 | 13.57947691 | 17.36% |
| 11 | 109.9511628 | 181.5852713 | 60.55% | 2.59374246 | 15.9374246 | 16.27% |
| 12 | 120 | 291.5364341 | 41.16% | 2.853116706 | 18.53116706 | 15.40% |
| 13 | 120 | 411.5364341 | 29.16% | 3.138428377 | 21.38428377 | 14.68% |
| 14 | 120 | 531.5364341 | 22.58% | 3.452271214 | 24.52271214 | 14.08% |
| 15 | 120 | 651.5364341 | 18.42% | 3.797498336 | 27.97498336 | 13.57% |
| 16 | 120 | 771.5364341 | 15.55% | 4.177248169 | 31.77248169 | 13.15% |
| 17 | 120 | 891.5364341 | 13.46% | 4.594972986 | 35.94972986 | 12.78% |
| 18 | 120 | 1011.536434 | 11.86% | 5.054470285 | 40.54470285 | 12.47% |
| 19 | 120 | 1131.536434 | 10.61% | 5.559917313 | 45.59917313 | 12.19% |
| 20 | 120 | 1251.536434 | 9.59% | 6.115909045 | 51.15909045 | 11.95% |
| 21 | 120 | 1371.536434 | 8.75% | 6.727499949 | 57.27499949 | 11.75% |
| 22 | 120 | 1491.536434 | 8.05% | 7.400249944 | 64.00249944 | 11.56% |
| 23 | 120 | 1611.536434 | 7.45% | 8.140274939 | 71.40274939 | 11.40% |
| 24 | 120 | 1731.536434 | 6.93% | 8.954302433 | 79.54302433 | 11.26% |
| 25 | 120 | 1851.536434 | 6.48% | 9.849732676 | 88.49732676 | 11.13% |
| 26 | 120 | 1971.536434 | 6.09% | 10.83470594 | 98.34705943 | 11.02% |
| 27 | 120 | 2091.536434 | 5.74% | 11.91817654 | 109.1817654 | 10.92% |
| 28 | 120 | 2211.536434 | 5.43% | 13.10999419 | 121.0999419 | 10.83% |
| 29 | 120 | 2331.536434 | 5.15% | 14.42099361 | 134.2099361 | 10.75% |
| 30 | 120 | 2451.536434 | 4.89% | 15.86309297 | 148.6309297 | 10.67% |
| 31 | 120 | 2571.536434 | 4.67% | 17.44940227 | 164.4940227 | 10.61% |
| 32 | 120 | 2691.536434 | 4.46% | 19.1943425 | 181.943425 | 10.55% |
| 33 | 120 | 2811.536434 | 4.27% | 21.11377675 | 201.1377675 | 10.50% |
| 34 | 120 | 2931.536434 | 4.09% | 23.22515442 | 222.2515442 | 10.45% |
| 35 | 120 | 3051.536434 | 3.93% | 25.54766986 | 245.4766986 | 10.41% |
| 36 | 120 | 3171.536434 | 3.78% | 28.10243685 | 271.0243685 | 10.37% |
| 37 | 120 | 3291.536434 | 3.65% | 30.91268053 | 299.1268053 | 10.33% |
The percentage of wait time is really high in the 1-300s range for the 1.6 multiplier. We probably would like being able to go even lower to e.g. 1.05.
I understand that having such a feature may lead to misuse. However, I suspect that without the ability to control the polling frequency people will just hack their own polling sidecars which will benefit no one.
I had played with the data myself earlier and was sort of thinking 1.1 or 1.05 were the likely lowest points. So the fact that you reached a similar conclusion seems good.
I'm tempted to go with 1.1, as it is close to 10% for most of the range before hitting 120. Earlier in the range, it is only ~5s before hitting ~10%, which seems quite fair.
I think 1.05 is the absolute bare minimum that I might consider. It takes over 7 minutes before the backoff exceeds the CONNECT_TIMEOUT=20s, which seems like a really long time.
One thing to understand is that connect() itself has an exponential backoff with a multiplier of 2 that you are likely to hit, since the VM can easily be a black hole during part of its booting. So you'll probably be seeing ~8s of backoffs even in the first connection attempt just within the kernel.