hubble
hubble copied to clipboard
hubble (cli) observe does not show traffic for a service - cilium v1.8.2 & hubble v0.6.1
Running the command: hubble observe --service namespace/service_name
does not show any traffic pertaining to the service.
bottom-left: hubble observe not working when using --service
but works fine w/ --pod
top-right: pod curling service in loop
bottom-right: useful info regarding the service and associated pods in the given namespace
@seaunderwater is this reproducible without hubble-relay
as well? By using the CLI directly on a particular node
@glibsm I was able to reproduce without hubble relay by directly kubectl exec
into a cilium pod and running hubble observe.
top-left: k exec
into cilium pod; hubble observe not working when using --service
but works fine w/ --pod
bottom-left: useful info regarding the service and associated pods in the given namespace
right: pod curling service to generate data
Looks like the affected flows are of Type: L7
. We currently don't support service annotations for these types of events (and thus can also not use --service
to filter on it).
The underlying problem is that L7 events (such as the displayed http-request
/http-response
) are emitted after service translation (NAT) has already happened.
One option (which we have discussed in the past) is to do a reverse-lookup, i.e. given a backend (e.g. server-error/liveness-http-84bd547f66-8srln
) check what frontend (the service name) might be associated with it. The problem with that approach is that it will yield false positives in cases where someone is directly accessing a pod (without going through the frontend), or where a pod is the target of multiple service selectors. In either case we would report service names which were not part of the request.
Not sure this is easily fixable, we would have to somehow forward the NAT from the BPF datapath to the proxy (which is emitting the http-request
/http-response
), or we would have to cross-correlate between events (e.g. combine the datapath event containing the service translation information with the L7 event), which we currently avoid.
@gandro thank you for taking the time to respond. While your explanation regarding L7 traffic makes sense, I also cannot see any L3 traffic related to a service.
e.g. I cannot see any traffic for the following variants:
hubble observe --service namespace/service_name
which defaults to L3 traffic.
hubble observe --service namespace/service_name --protocol tcp
Any thoughts on this?
There is a second case (and only remaining) case where I know service translation information is not available to Hubble, which is when Cilium performs socket-based loadbalancing, called HostReachableServices
internally.
In that case, the service translation is happening early at the socket layer, before the packets reach trace points where Hubble can collect them. Also in that case, all events happen after the service translation has already been performed.
You can check if HostReachableServices
is enabled on your installation by checking the cilium status
output:
$ kubectl exec -it -n kube-system cilium-fmh8d -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement: Strict [eth0 (DR), eth1] [..., HostReachableServices (TCP, UDP)]
That should conceptually be fixable (at least on newer kernels) in Cilium's datapath, e.g.by emitting additional trace events for the socket layer. We have already discussed doing this in the past, but haven't gotten around to it.
I will open an issue for that, in case anybody wants to work on it.
Issue on the Cilium-side for the the HostReachableServices
case: cilium/cilium#12954
In my case, the output is the following:
$ k exec -it cilium-zzdg8 -n kube-system -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement: Partial [eth0 (DR)] [NodePort (SNAT, 30000-32767, XDP: DISABLED), HostPort, ExternalIPs, SessionAffinity]
Okay, in that case that sounds like a different issue then!
Is it possible for you to share the L3/L4 flows emitted when you do a curl from a pod to the service? E.g. hubble observe -o json --pod server-error/curl
? In addition, the output of cilium service list
would also help us troubleshoot the issue further.
@gandro here is the information you requested.
- The last element of
cilium service list
matches the service ip ofserver-error/liveness-http
$ k exec -it -n kube-system cilium-hjftq -- cilium service list
ID Frontend Service Type Backend
1 10.96.0.10:53 ClusterIP 1 => 10.244.2.253:53
2 => 10.244.1.110:53
2 10.96.0.10:9153 ClusterIP 1 => 10.244.2.253:9153
2 => 10.244.1.110:9153
3 10.111.80.144:80 ClusterIP 1 => 10.244.0.230:4245
4 10.98.123.96:80 ClusterIP 1 => 10.244.1.162:12000
5 10.96.0.1:443 ClusterIP 1 => 172.18.0.4:6443
6 10.97.237.54:8080 ClusterIP 1 => 10.244.2.209:8080
$ kg svc -n server-error -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
liveness-http ClusterIP 10.97.237.54 <none> 8080/TCP 27h app=liveness-http
- Attached is the output of
hubble observe -o json --pod server-error/curl
after a single request from pod to service. flows.json.txt
@seaunderwater Thank you!
It looks like the client pod (curl-5858f4ff79-4nt8q
) seems to directly connect to the pod IP (10.244.2.209
) after the DNS lookup returns. I don't see the service cluster IP (10.97.237.54
) used at all. While this explains why Hubble cannot associate the service name (it performs association based on the service cluster IP), I still don't understand why your curl pod would direct access the backend (i.e. the pod IP) directly instead of going through the service cluster IP frontend first.
From the top of my head, I don't know of any other scenario outside of HostReachableServices
in which that would happen. What does the service DNS lookup on liveness-http.server-error.svc.cluster.local
resolve to, i.e. what IPs does it return?
@glibsm This is why I am confused because as you can see in the flows, the L7 query is using the service DNS name, yet "destination" resolves to pod IP.
DNS lookup resolves to the service cluster IP:
root@curl-5858f4ff79-4nt8q:/ ]$ nslookup liveness-http.server-error.svc.cluster.local
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: liveness-http.server-error.svc.cluster.local
Address 1: 10.97.237.54 liveness-http.server-error.svc.cluster.local
Going back to the following:
$ k exec -it cilium-zzdg8 -n kube-system -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement: Partial [eth0 (DR)] [NodePort (SNAT, 30000-32767, XDP: DISABLED), HostPort, ExternalIPs, SessionAffinity]
Does XDP: DISABLED
have any kind of implication on hubble observe --service
?
Does
XDP: DISABLED
have any kind of implication onhubble observe --service
?
If XDP acceleration is enabled, it might affect the visible service translation for NodePort/LoadBalancer services, but in your case it's a ClusterIP service.
Would you mind posting the full output of cilium status
, to make sure we are not missing anything?
Seeing the same problem
cilium status -n default
/¯¯\
/¯¯\__/¯¯\ Cilium: OK
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Hubble: OK
\__/¯¯\__/ ClusterMesh: disabled
\__/
DaemonSet cilium Desired: 2, Ready: 2/2, Available: 2/2
Deployment cilium-operator Desired: 2, Ready: 2/2, Available: 2/2
Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1
Deployment hubble-ui Desired: 1, Ready: 1/1, Available: 1/1
Containers: hubble-ui Running: 1
cilium Running: 2
cilium-operator Running: 2
hubble-relay Running: 1
Cluster Pods: 22/62 managed by Cilium
Image versions cilium quay.io/cilium/cilium:v1.12.0-rc1: 2
cilium-operator quay.io/cilium/operator-generic:v1.12.0-rc1: 2
hubble-relay quay.io/cilium/hubble-relay:v1.12.0-rc1: 1
hubble-ui quay.io/cilium/hubble-ui-backend:v0.8.5@sha256:2bce50cf6c32719d072706f7ceccad654bfa907b2745a496da99610776fe31ed: 1
hubble-ui docker.io/envoyproxy/envoy:v1.18.2@sha256:e8b37c1d75787dd1e712ff389b0d37337dc8a174a63bed9c34ba73359dc67da7: 1
hubble-ui quay.io/cilium/hubble-ui:v0.8.5@sha256:4eaca1ec1741043cfba6066a165b3bf251590cf4ac66371c4f63fbed2224ebb4: 1
Can't persuade GitHub to let me upload a sysdump so for now I have put it here