AKS
AKS copied to clipboard
[BUG] Exposition of port 9965 on cilium pods and service label selectors missing
Describe the bug
After enabling ACNS according the docs https://learn.microsoft.com/en-us/azure/aks/advanced-network-observability-cli?tabs=cilium#visualization-using-byo-grafana, the goal is to visualize hubble metrics in Grafana. Enabling ACNS successfully installs cilium and its pods - you can fetch metrics from the pod by executing in the pod using kubectl exec -it <cilium-pod> -- /bin/bash and (after installing curl or wget in the container) run curl -X GET localhost:9965/metrics.
However, the hubble metrics server port 9965 is not exposed by the cilium pod. The only port which is exposed by the pod is 9962, which references on Cilium metrics only:
ports:
- containerPort: 9962
hostPort: 9962
name: prometheus
protocol: TCP
Additionally, the service in kube-system namespace network-observability does not select pods due to missing endpoints because the do not have label selectors which makes it unable to build a servicemonitor for adding scrape config to the prometheus (like it's describe in the docs above). The service network-observability should have a label selector on k8s-app: cilium - see the following yaml snippet:
# this is a customer generated service that selects the pods by the selector. field
apiVersion: v1
kind: Service
metadata:
annotations:
meta.helm.sh/release-name: aks-managed-kappie
labels:
k8s-app: hubble-workaround
name: network-observability-workaround
namespace: kube-system
spec:
ports:
- name: hubble
port: 9965
protocol: TCP
targetPort: 9965
- name: cilium
port: 9962
protocol: TCP
targetPort: 9962
type: ClusterIP
selector: # missing selector
k8s-app: cilium
To Reproduce
For steps to reproduce the behavior, see above.
❯ kubectl port-forward -n kube-system svc/network-observability 9965:9965
error: cannot attach to *v1.Service: invalid service 'network-observability': Service is defined without a selector.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- Kubernetes version: Client Version: v1.31.2, Kustomize Version: v5.4.2, Server Version: v1.31.2
Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Issue needing attention of @Azure/aks-leads
Running into the same issue.
Trying to workaround... I got cilium metrics scraped with a PodMonitor instead (prometheus named port) but because 9965 is not exposed, those metrics are not getting scraped :(
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: cilium-network-observability
namespace: monitoring
labels:
app.kubernetes.io/part-of: cilium
spec:
podMetricsEndpoints:
- port: prometheus
path: /metrics
interval: 30s
- targetPort: 9965
path: /metrics
interval: 30s
selector:
matchLabels:
k8s-app: cilium
namespaceSelector:
matchNames:
- kube-system
@chasewilson, @paulgmiller, @wedaly, @quantumn-a5, @tamilmani1989 would you be able to assist?
Hi @lukibahr , thanks for raising the issue.
port 9965
Even though 9965 is not specified as a containerPort in the cilium daemonset, Cilium still exposes metrics on this port.
$ kns kube-system
$ kubectl get cm cilium-config -oyaml | grep 9965
hubble-metrics-server: :9965
$ kubectl port-forward cilium-ngx7p 9965:9965 &
$ curl -s localhost:9965/metrics | grep hubble
Handling connection for 9965
# HELP hubble_drop_total Number of drops
# TYPE hubble_drop_total counter
hubble_drop_total{destination="",protocol="ICMPv6",reason="UNSUPPORTED_L3_PROTOCOL",source=""} 578
...
network-observability service
This service is created by AKS for managed Prometheus offering and is not recommended for querying metrics from agents. I like the PodMonitor way suggested by @niekcandaele .
@niekcandaele I am curious why the PodMonitor spec uses both port and targetPort. Won't the following spec work?
podMetricsEndpoints:
- port: 9965
path: /metrics
interval: 30s
This issue will now be closed because it hasn't had any activity for 7 days after stale. @lukibahr feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.