flyte Add KeepAliveParameters to agent client

Tracking issue

https://github.com/flyteorg/flyte/issues/3936

Describe your changes

Agents use headless services to balance client load. However, If HPA is used, propeller needs to be restarted to get a new set of IP addresses.

We could add KeepAliveParameters to gRPC client, so that propeller can get a new set of IPs every 10s by default.

https://discuss.kubernetes.io/t/how-to-do-load-balance-with-grpc-connection-when-hpa-autoscaling-is-enabled/16612

Check all the applicable boxes

[ ] I updated the documentation accordingly.
[x] All new and existing tests passed.
[x] All commits are signed-off.

Screenshots

Note to reviewers

Oct 03 '23 03:10 pingsutw

cc @honnix mind taking a look

Oct 03 '23 03:10 pingsutw

Codecov Report

Attention: Patch coverage is 67.50000% with 52 lines in your changes missing coverage. Please review.

Project coverage is 59.32%. Comparing base (b35cc95) to head (6c43ad3). Report is 964 commits behind head on master.

:exclamation: Current head 6c43ad3 differs from pull request most recent head 2a55000

Please upload reports for the commit 2a55000 to get more accurate results.

Files	Patch %	Lines
...er/pkg/controller/nodes/task/k8s/plugin_manager.go	59.15%	24 Missing and 5 partials :warning:
...ler/pkg/controller/nodes/task/k8s/event_watcher.go	77.77%	12 Missing :warning:
...propeller/pkg/controller/nodes/task/transformer.go	61.11%	6 Missing and 1 partial :warning:
flytepropeller/pkg/controller/controller.go	0.00%	4 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4157      +/-   ##
==========================================
+ Coverage   58.98%   59.32%   +0.34%     
==========================================
  Files         618      550      -68     
  Lines       52708    39699   -13009     
==========================================
- Hits        31088    23551    -7537     
+ Misses      19140    13828    -5312     
+ Partials     2480     2320     -160

Flag	Coverage Δ
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Oct 03 '23 03:10 codecov[bot]

Nice finding! I did not know there are no reasonable defaults for these. I guess we need the same thing for https://github.com/flyteorg/flytepropeller/blob/0fcc1da879333af1282ee1142651e191eb1d6bb4/pkg/controller/nodes/catalog/config.go#L44, and https://github.com/flyteorg/flyteidl/blob/85bfc8ff1c36b2d37a4fcc68a2be1911a9fc3940/clients/go/admin/client.go#L173 ?

Oct 03 '23 06:10 honnix

After reading the doc more in-depth, I'm wondering whether this keepalive is intended for IP refreshing. It seems not to me. I think the problem might be more related to name resolver. The ping itself may not trigger name resolution refreshing.

There are more info:

https://grpc.io/docs/guides/custom-name-resolution/
https://learn.microsoft.com/en-us/aspnet/core/grpc/loadbalancing?view=aspnetcore-7.0#dns-address-caching (it's dotnet but should be general enough)

That being said, I think this configuration still makes sense. It's just I'm not sure it solves the problem. Internally we have a custom name resolver goes together with client-side load balancing so we have not seen this problem.

Oct 03 '23 06:10 honnix

It did solve the issue after adding these config. do you use HPA internally? The problem is that I need to restart the propeller if I scale up the agent. otherwise, propeller won't send the request to the new pod.

Oct 03 '23 07:10 pingsutw

https://github.com/grpc/grpc/issues/12295#issuecomment-893618782
https://github.com/grpc/grpc-go/blob/9e1fc3e9c088387890f5a303b805d320f11cc6dd/examples/features/name_resolving/client/main.go#L92-L94 You're right. updating resolver might be a better solution. let me add one and test it.

Oct 03 '23 07:10 pingsutw

It did solve the issue after adding these config. do you use HPA internally? The problem is that I need to restart the propeller if I scale up the agent. otherwise, propeller won't send the request to the new pod.

I actually can't explain how this regular ping would trigger name resolution and why it would be different than normal traffic when coming to name resolution. Without these configuration, if there is no normal traffic there is no resolution, and when traffic comes back it should work the same as sending a ping, triggering resolution if DNS cache timed out. Do you find any doc stating that ping explicitly forces resolution?

Yes we use HPA.

Oct 03 '23 09:10 honnix

Cleaning stale PRs. Please reopen if you wan to discuss this further.

Mar 03 '25 16:03 eapolinario