cilium icon indicating copy to clipboard operation
cilium copied to clipboard

CFP: Add ability to infer cross-failure domain traffic through cilium/hubble metrics

Open olemarkus opened this issue 3 years ago • 1 comments

Cilium Feature Proposal

Is your feature request related to a problem?

Kubernetes and Cilium have the ability to route traffic based on topoloy, but it can be tricky to observe the actual effect of those settings. For those on private clouds with cross-zone cost, having this data would also be very useful.

Describe the feature you'd like

Aggregated metrics on cross-failure domain traffic.

(Optional) Describe your proposed solution

Kubernetes nodes all come with a set of topology flag, where topology.kubernetes.io/zone is perhaps the most interesting for this use case. If endpoint -> endpoint traffic could be aggregated up to node -> node traffic and then zone -> zone, that would be very interesting.

But also being able to see which endpoints cause cross-zone traffic would be interesting to determine the effect of topology aware routing.

Please complete this section if you have ideas / suggestions on how to implement the feature. We strongly recommend discussing your approach with Cilium committers before spending lots of time implementing a change.

olemarkus avatar Aug 30 '22 05:08 olemarkus

I'll see if this requires any additional work inside the hubble metrics sub-system, but I suspect this may be possible using metrics relabeling or by joining against metrics with kube-state-metrics. If it is, I'll try to provide an example here of how that can be achieved.

chancez avatar Aug 30 '22 15:08 chancez

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] avatar Nov 21 '22 02:11 github-actions[bot]

Still interested in this. @yurrriq you may also be interested in following this one.

olemarkus avatar Nov 21 '22 10:11 olemarkus

Ok, sorry for the long wait, I let this slip by. Here's an example:

You'll need to add the node labels you want to your kube-state-metrics --metric-labels-allowlist. For example, --metrics-label-allowlist nodes=[topology.kubernetes.io/zone].

You also need the node label on the hubble/cilium metrics. Unfortunately, relabelings on our built-in ServiceMonitor is not yet exposed, however it previously the node label was configured by default, before I (@chancez) removed it in https://github.com/cilium/cilium/pull/21051. I'll make a PR to fix that. Until then, you can copy the built-in ServiceMonitor and add the required relabelings:

Here's an example of what that looks like:

(⎈|kind-kind:default) ~/p/w/kind-cilium-ce-helm-install ❯❯❯ k get servicemonitors -n kube-system hubble -o yaml                                                                                          main ⬆ ✭ ✱ ◼
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: cilium
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2022-11-17T21:53:10Z"
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: cilium
  name: hubble
  namespace: kube-system
  resourceVersion: "298446"
  uid: 51bdc925-559d-41d4-9fc1-268693872125
spec:
  endpoints:
  - honorLabels: true
    interval: 15s
    path: /metrics
    port: hubble-metrics
    relabelings:
    - action: replace
      replacement: ${1}
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: node
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      k8s-app: hubble

Once you've done that, the kube_node_labels will have the topology.kubernetes.io/zone node label as a metric label label_topology_kubernetes_io_zone that we can "JOIN" against with the hubble/cilium metric. Let's use hubble_flows_processed_total as the example metric.

hubble_flows_processed_total * on(node) group_left(label_topology_kubernetes_io_zone) kube_node_labels

This gets us the hubble_flows_processed_total with a zone label. So we can do something like this:

sum(hubble_flows_processed_total * on(node) group_left(label_topology_kubernetes_io_zone) kube_node_labels) by (verdict, label_topology_kubernetes_io_zone)

This gets us the flow verdicts by zone (for example, you may expect a lot of drops from one zone if you had connectivity issues between two AZs).

Hope that helps. If this solves your use-case feel free to close the issue. I'll work on fixing our ServiceMonitors to expose the relabelings configuration again so you can more easily add the node as a label to the hubble/cilium metrics.

chancez avatar Nov 21 '22 20:11 chancez

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] avatar Jan 21 '23 01:01 github-actions[bot]

This issue has not seen any activity since it was marked stale. Closing.

github-actions[bot] avatar Feb 05 '23 02:02 github-actions[bot]