cilium
cilium copied to clipboard
CFP: Add ability to infer cross-failure domain traffic through cilium/hubble metrics
Cilium Feature Proposal
Is your feature request related to a problem?
Kubernetes and Cilium have the ability to route traffic based on topoloy, but it can be tricky to observe the actual effect of those settings. For those on private clouds with cross-zone cost, having this data would also be very useful.
Describe the feature you'd like
Aggregated metrics on cross-failure domain traffic.
(Optional) Describe your proposed solution
Kubernetes nodes all come with a set of topology flag, where topology.kubernetes.io/zone is perhaps the most interesting for this use case. If endpoint -> endpoint traffic could be aggregated up to node -> node traffic and then zone -> zone, that would be very interesting.
But also being able to see which endpoints cause cross-zone traffic would be interesting to determine the effect of topology aware routing.
Please complete this section if you have ideas / suggestions on how to implement the feature. We strongly recommend discussing your approach with Cilium committers before spending lots of time implementing a change.
I'll see if this requires any additional work inside the hubble metrics sub-system, but I suspect this may be possible using metrics relabeling or by joining against metrics with kube-state-metrics. If it is, I'll try to provide an example here of how that can be achieved.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Still interested in this. @yurrriq you may also be interested in following this one.
Ok, sorry for the long wait, I let this slip by. Here's an example:
You'll need to add the node labels you want to your kube-state-metrics --metric-labels-allowlist. For example, --metrics-label-allowlist nodes=[topology.kubernetes.io/zone].
You also need the node label on the hubble/cilium metrics. Unfortunately, relabelings on our built-in ServiceMonitor is not yet exposed, however it previously the node label was configured by default, before I (@chancez) removed it in https://github.com/cilium/cilium/pull/21051. I'll make a PR to fix that. Until then, you can copy the built-in ServiceMonitor and add the required relabelings:
Here's an example of what that looks like:
(⎈|kind-kind:default) ~/p/w/kind-cilium-ce-helm-install ❯❯❯ k get servicemonitors -n kube-system hubble -o yaml main ⬆ ✭ ✱ ◼
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
meta.helm.sh/release-name: cilium
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2022-11-17T21:53:10Z"
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: cilium
name: hubble
namespace: kube-system
resourceVersion: "298446"
uid: 51bdc925-559d-41d4-9fc1-268693872125
spec:
endpoints:
- honorLabels: true
interval: 15s
path: /metrics
port: hubble-metrics
relabelings:
- action: replace
replacement: ${1}
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: node
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
k8s-app: hubble
Once you've done that, the kube_node_labels will have the topology.kubernetes.io/zone node label as a metric label label_topology_kubernetes_io_zone that we can "JOIN" against with the hubble/cilium metric. Let's use hubble_flows_processed_total as the example metric.
hubble_flows_processed_total * on(node) group_left(label_topology_kubernetes_io_zone) kube_node_labels
This gets us the hubble_flows_processed_total with a zone label. So we can do something like this:
sum(hubble_flows_processed_total * on(node) group_left(label_topology_kubernetes_io_zone) kube_node_labels) by (verdict, label_topology_kubernetes_io_zone)
This gets us the flow verdicts by zone (for example, you may expect a lot of drops from one zone if you had connectivity issues between two AZs).
Hope that helps. If this solves your use-case feel free to close the issue. I'll work on fixing our ServiceMonitors to expose the relabelings configuration again so you can more easily add the node as a label to the hubble/cilium metrics.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.