ztunnel icon indicating copy to clipboard operation
ztunnel copied to clipboard

Provide a new label for metrics regarding destination type

Open howardjohn opened this issue 1 year ago • 14 comments

I propose adding a new label to all the traffic metrics in ztunnel: traffic_type: waypoint|direct.

The motivation here is to allow querying for traffic without duplication when dealing with waypoints.

When I have a waypoint, I will get 3 timeseries: client-->waypoint, client-->server, waypoint-->server.

Logically, I know the first and last metrics are "to/from waypoint" traffic. However, there is no query to actually express that information, so I cannot filter it out. With the traffic_type we could identify this.

On outbound this is trivial. Its a bit less clear if we can 100% identify waypoint traffic on the inbound path

howardjohn avatar Jun 10 '24 17:06 howardjohn

For Sandwich, we use find_inbound_upstream: Checks if the connection address is a workload that is part of the HBONE address's Waypoints.

The tricky part for inbound to-workload is figuring out if the traffic was originally Service or Workload addressed. Can we make a good enough guess here for telemetry purposes? For example, inbound workload has both wl-wp and svc-wp, we can look at the source address + identity and see if it matches either of these and give one of them priority.

Other option would be Waypoint adding a header with the original HBONE target that it saw.

stevenctl avatar Jun 10 '24 17:06 stevenctl

Dumb question: wouldn't (Istio) waypoints provide some header with a value of envoy?

keithmattix avatar Jun 10 '24 18:06 keithmattix

There are some headers, I don't know any that are guaranteed unless we add one

howardjohn avatar Jun 10 '24 18:06 howardjohn

When I have a waypoint, I will get 3 timeseries: client-->waypoint, client-->server, waypoint-->server.

Just to make sure I'm clear... client-->waypoint and waypoint-->server are reported by ztunnel (app="ztunnel" , reporter="destination"), and the client-->server is reported by the waypoint itself (reporter="waypoint"). Right?

I think for ztunnel, traffic_type="direct" would apply for non-waypoint traffic?

I think it would be useful to add traffic_type: waypoint|direct, as long as the waypoint identification is reliable, both when it is the source and dest workload. If there isn't a way to identify the inbound situation then it wouldn't be worth the overhead.

jshaughn avatar Sep 12 '24 19:09 jshaughn

maybe too verbose, but just an idea: traffic_type: waypoint_to|waypoint_from|direct

jshaughn avatar Sep 13 '24 13:09 jshaughn

The 'from waypoint' is the tricky part. 'To waypoint' is easy, and direct = !towaypoint && !fromwaypoint, but there is no reliable way to know if its from a waypoint. We can apply some heuristics -- perhaps even good enough ones -- but no guarantees for sure

howardjohn avatar Sep 13 '24 15:09 howardjohn

The unfortunate effect is that currently, to try and identify these waypoint edges, a consumer like Kiali needs to combine waypoint config and telemetry. This is a slippery slope because config is current, but telem reflects the past. So it's much better if everything can be identified in the telemetry.

Just wanted to confirm my question above, with waypoints, the client-->server telem is only reported by waypoint, right? But waypoint reporting is only for request traffic, afaics. So, for a service using a waypoint that is handling app-level TCP traffic, there is no client-server TCP telem, the only telem is disconnected, client-->waypoint and waypoint-->server. Is that right?

jshaughn avatar Sep 13 '24 15:09 jshaughn

No, waypoint should report TCP as well if the service is TCP (same semantics as a sidecar)

howardjohn avatar Sep 13 '24 17:09 howardjohn

No, waypoint should report TCP as well if the service is TCP (same semantics as a sidecar)

Hmmm, I'm not seeing it. I'll re-check...

jshaughn avatar Sep 13 '24 18:09 jshaughn

Hmmm, I'm not seeing it. I'll re-check...

I think due to https://github.com/istio/istio/issues/53593

jshaughn avatar Nov 04 '24 20:11 jshaughn

This is still an observability pain-point.

jshaughn avatar Mar 27 '25 17:03 jshaughn

@jshaughn What's the specific pain point?

keithmattix avatar Mar 27 '25 18:03 keithmattix

Hi @keithmattix , the pain is still pretty much as described in the description and comments. From a Kiali perspective, we can't tell from the telemetry whether the traffic involves a waypoint. This is important because there is a bunch of waypoint-specific logic when visualizing ambient traffic. That means we have to actually pull the workloads from k8s, try to figure out which ones are waypoints, and then use that to try and determine whether the source or dest of a time-series is in fact a waypoint. This is hard and uses extra compute. But moreover, it's also weak, because the workload config is current k8s info, and telemetry can easily be in the past. So we do the comparison hoping the current config is valid for the past telem.

The problem seems to be that it's not straightforward to flag the from-waypoint scenario.

jshaughn avatar Mar 27 '25 18:03 jshaughn

2 ideas come to mind:

  1. add a header like x-istio-telemetry-hbone-source=ztunnel|waypoint|waypoint-sandwhich|sidecar|gw|gw-sandwich|proxyless|whatever to the HBONE connection and report it for telemetry purposes.
  2. repurpose some form of the old logic that was used to allow "from my waypoint" to bypass normal authorization pol enforcement to infer. Gotta be careful that our guess about "waypoint or not" doesn't escape back into other logic though.

ilrudie avatar Mar 27 '25 19:03 ilrudie