jaeger icon indicating copy to clipboard operation
jaeger copied to clipboard

[Feature]: Allow configuring keep-alive settings in the agent as gRPC client

Open yurishkuro opened this issue 2 years ago • 5 comments

Requirement

From: https://cloud-native.slack.com/archives/CGG7NFUJ3/p1659627880482739

question on using the jaeger-agent in combination with a collector that is hosted/run and load-balanced in k8s: I found the various jaeger-agent configuration settings for load-balanced collector endpoints, but from what I read (and from experiments), the endpoints that the agent chooses are selected at start and do not changed afterwards. In our scenario, we use the jaeger-agent as k8s daemonset (so, 1 instance per k8s worker node) that sends data to a collector deployed as regular k8s daemonset which is auto-scaled to handle the volume increasing/decreasing over time. In our specific scenario we are using the opentelemetry collector, but not sure that changes conceptually what I'm asking here. The issue that we see is that since the jaeger-agents are very long-lived services (the worker nodes rarely come or go), choosing the collector endpoint/IP using the collector service hostname means that if the collector scales up, the scaled-up instances will not be used in practice. This is because of the long-lived TCP connections between the jaeger-agent and the chosen collector, whose IP(s) are resolved on start and used essentially forever. Thus, only if the jaeger-agent were to restart would we ever see traffic to a collector instance started (scaled up) after the agent had started. As a result, even if the collector scales up and new endpoint IPs are now available behind the k8s service, only the old collector IPs continue to receive traffic and the scale-up had no effect. Side-node: of course, if a collector instance is removed, it would cause a re-balance as the IP becomes unreachable and the jaeger-agent needs to re-select the upstream IPs to talk to. But on collector scale-up, the old IPs remain valid.

FYI, could be interesting for others to know: the idea of a purely load-balanced solution is not that easy in k8s, as in most situations the pod IPs of the collector will be abstracted away behind a k8s service. Thus, the jaeger-agent will only see one IP, and periodically forcing a DNS-re-resolution doesn't yield a change (we'll simply lookup the very same IP again). However, the gRPC server can help here: for example, in golang grpc (grpc-go ), which is used by most tools including jaeger-agent and opentelemetry-collector, allows enforcing a maximum connection age. This can be configured for idle connections but also for any connection:

keepalive:
  server_parameters:
    max_connection_age: "<max-age>s"
    max_connection_age_grace: "<a few>s"

It's not overly clean, as the client is told to disconnect/reconnect periodically, even if there is no need (i.e., even if no new IPs have been added to the service). But, it's an effective solution to the problem if periodic reconnects are acceptable. The newly established connection will use a new TCP source port, and the k8s service loadbalancer will choose a new pod IP (at random). Thus, over time, the clients balance out their connections to the available collector pods

Problem

We already support configuring keepalive on the server (e.g. --collector.grpc-server.max-connection-age), but it would be good to support the same in the agent as the client. If server closes the connection it manifests as an error to the client.

Proposal

There doesn't seem to be a client-side gRPC option to control keep-alive connection age, but maybe it is possible to close the connections manually.

Open questions

No response

yurishkuro avatar Aug 11 '22 02:08 yurishkuro

Hi, can I take this?

ajitalyana avatar Sep 05 '22 21:09 ajitalyana

Yes

yurishkuro avatar Sep 05 '22 22:09 yurishkuro

@yurishkuro I have gone through the slack discussion and client-side there is no Connection timeout support. Could you eloborate on the comment "may be possible to close the connections manually " .

ajitalyana avatar Sep 06 '22 02:09 ajitalyana

Could you elaborate on the comment "may be possible to close the connections manually".

We could have the logic in the agent that would try to reconnect to the server(s) every N seconds (N being configurable, e.g. 30sec by default). I don't know how exactly to do that w/ gRPC, maybe via custom load balancer, or by forcing a redial.

yurishkuro avatar Sep 06 '22 02:09 yurishkuro

@yurishkuro I am interested in working on it, can guide me from where I can start? I have went through slack discussion and also through code base.

lakshkeswani avatar Sep 17 '22 19:09 lakshkeswani

Hi! Is this worked on? may I pick this up?

vishal-chdhry avatar Nov 16 '22 14:11 vishal-chdhry

Please assign me.

yanyanran avatar Mar 04 '23 03:03 yanyanran

Hi @yurishkuro, is this issue still relevant? Since jaeger-agent is stated as deprecated at v1.43. If it is, I would like to work on this.

james-ryans avatar Apr 06 '23 13:04 james-ryans

You're right, we can close it, not worth the investment at this point.

yurishkuro avatar Apr 06 '23 23:04 yurishkuro