Add KeepAliveParameters to agent client
Tracking issue
https://github.com/flyteorg/flyte/issues/3936
Describe your changes
Agents use headless services to balance client load. However, If HPA is used, propeller needs to be restarted to get a new set of IP addresses.
We could add KeepAliveParameters to gRPC client, so that propeller can get a new set of IPs every 10s by default.
- https://discuss.kubernetes.io/t/how-to-do-load-balance-with-grpc-connection-when-hpa-autoscaling-is-enabled/16612
Check all the applicable boxes
- [ ] I updated the documentation accordingly.
- [x] All new and existing tests passed.
- [x] All commits are signed-off.
Screenshots
Note to reviewers
cc @honnix mind taking a look
Codecov Report
Attention: Patch coverage is 67.50000% with 52 lines in your changes missing coverage. Please review.
Project coverage is 59.32%. Comparing base (
b35cc95) to head (6c43ad3). Report is 964 commits behind head on master.
:exclamation: Current head 6c43ad3 differs from pull request most recent head 2a55000
Please upload reports for the commit 2a55000 to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## master #4157 +/- ##
==========================================
+ Coverage 58.98% 59.32% +0.34%
==========================================
Files 618 550 -68
Lines 52708 39699 -13009
==========================================
- Hits 31088 23551 -7537
+ Misses 19140 13828 -5312
+ Partials 2480 2320 -160
| Flag | Coverage Δ | |
|---|---|---|
| unittests | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Nice finding! I did not know there are no reasonable defaults for these. I guess we need the same thing for https://github.com/flyteorg/flytepropeller/blob/0fcc1da879333af1282ee1142651e191eb1d6bb4/pkg/controller/nodes/catalog/config.go#L44, and https://github.com/flyteorg/flyteidl/blob/85bfc8ff1c36b2d37a4fcc68a2be1911a9fc3940/clients/go/admin/client.go#L173 ?
After reading the doc more in-depth, I'm wondering whether this keepalive is intended for IP refreshing. It seems not to me. I think the problem might be more related to name resolver. The ping itself may not trigger name resolution refreshing.
There are more info:
- https://grpc.io/docs/guides/custom-name-resolution/
- https://learn.microsoft.com/en-us/aspnet/core/grpc/loadbalancing?view=aspnetcore-7.0#dns-address-caching (it's dotnet but should be general enough)
That being said, I think this configuration still makes sense. It's just I'm not sure it solves the problem. Internally we have a custom name resolver goes together with client-side load balancing so we have not seen this problem.
It did solve the issue after adding these config. do you use HPA internally? The problem is that I need to restart the propeller if I scale up the agent. otherwise, propeller won't send the request to the new pod.
- https://github.com/grpc/grpc/issues/12295#issuecomment-893618782
- https://github.com/grpc/grpc-go/blob/9e1fc3e9c088387890f5a303b805d320f11cc6dd/examples/features/name_resolving/client/main.go#L92-L94 You're right. updating resolver might be a better solution. let me add one and test it.
It did solve the issue after adding these config. do you use HPA internally? The problem is that I need to restart the propeller if I scale up the agent. otherwise, propeller won't send the request to the new pod.
I actually can't explain how this regular ping would trigger name resolution and why it would be different than normal traffic when coming to name resolution. Without these configuration, if there is no normal traffic there is no resolution, and when traffic comes back it should work the same as sending a ping, triggering resolution if DNS cache timed out. Do you find any doc stating that ping explicitly forces resolution?
Yes we use HPA.
Cleaning stale PRs. Please reopen if you wan to discuss this further.