hubble icon indicating copy to clipboard operation
hubble copied to clipboard

hubble (cli) observe does not show traffic for a service - cilium v1.8.2 & hubble v0.6.1

Open seaunderwater opened this issue 4 years ago • 14 comments

Running the command: hubble observe --service namespace/service_name does not show any traffic pertaining to the service.

bottom-left: hubble observe not working when using --service but works fine w/ --pod top-right: pod curling service in loop bottom-right: useful info regarding the service and associated pods in the given namespace hubble-observe

seaunderwater avatar Aug 13 '20 16:08 seaunderwater

@seaunderwater is this reproducible without hubble-relay as well? By using the CLI directly on a particular node

glibsm avatar Aug 13 '20 18:08 glibsm

@glibsm I was able to reproduce without hubble relay by directly kubectl exec into a cilium pod and running hubble observe.

top-left: k exec into cilium pod; hubble observe not working when using --service but works fine w/ --pod bottom-left: useful info regarding the service and associated pods in the given namespace right: pod curling service to generate data

hubble-observe-node

seaunderwater avatar Aug 13 '20 23:08 seaunderwater

Looks like the affected flows are of Type: L7. We currently don't support service annotations for these types of events (and thus can also not use --service to filter on it).

The underlying problem is that L7 events (such as the displayed http-request/http-response) are emitted after service translation (NAT) has already happened.

One option (which we have discussed in the past) is to do a reverse-lookup, i.e. given a backend (e.g. server-error/liveness-http-84bd547f66-8srln) check what frontend (the service name) might be associated with it. The problem with that approach is that it will yield false positives in cases where someone is directly accessing a pod (without going through the frontend), or where a pod is the target of multiple service selectors. In either case we would report service names which were not part of the request.

Not sure this is easily fixable, we would have to somehow forward the NAT from the BPF datapath to the proxy (which is emitting the http-request/http-response), or we would have to cross-correlate between events (e.g. combine the datapath event containing the service translation information with the L7 event), which we currently avoid.

gandro avatar Aug 17 '20 11:08 gandro

@gandro thank you for taking the time to respond. While your explanation regarding L7 traffic makes sense, I also cannot see any L3 traffic related to a service.

e.g. I cannot see any traffic for the following variants: hubble observe --service namespace/service_name which defaults to L3 traffic. hubble observe --service namespace/service_name --protocol tcp

Any thoughts on this?

seaunderwater avatar Aug 24 '20 00:08 seaunderwater

There is a second case (and only remaining) case where I know service translation information is not available to Hubble, which is when Cilium performs socket-based loadbalancing, called HostReachableServices internally.

In that case, the service translation is happening early at the socket layer, before the packets reach trace points where Hubble can collect them. Also in that case, all events happen after the service translation has already been performed.

You can check if HostReachableServices is enabled on your installation by checking the cilium status output:

$ kubectl exec -it -n kube-system cilium-fmh8d -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement:   Strict      [eth0 (DR), eth1]       [..., HostReachableServices (TCP, UDP)]

That should conceptually be fixable (at least on newer kernels) in Cilium's datapath, e.g.by emitting additional trace events for the socket layer. We have already discussed doing this in the past, but haven't gotten around to it.

I will open an issue for that, in case anybody wants to work on it.

gandro avatar Aug 24 '20 08:08 gandro

Issue on the Cilium-side for the the HostReachableServices case: cilium/cilium#12954

gandro avatar Aug 24 '20 09:08 gandro

In my case, the output is the following:

$ k exec -it cilium-zzdg8 -n kube-system -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement:   Partial   [eth0 (DR)]   [NodePort (SNAT, 30000-32767, XDP: DISABLED), HostPort, ExternalIPs, SessionAffinity]

seaunderwater avatar Aug 24 '20 18:08 seaunderwater

Okay, in that case that sounds like a different issue then!

Is it possible for you to share the L3/L4 flows emitted when you do a curl from a pod to the service? E.g. hubble observe -o json --pod server-error/curl? In addition, the output of cilium service list would also help us troubleshoot the issue further.

gandro avatar Aug 25 '20 07:08 gandro

@gandro here is the information you requested.

  1. The last element of cilium service list matches the service ip of server-error/liveness-http
$ k exec -it -n kube-system cilium-hjftq -- cilium service list
ID   Frontend            Service Type   Backend                   
1    10.96.0.10:53       ClusterIP      1 => 10.244.2.253:53      
                                        2 => 10.244.1.110:53      
2    10.96.0.10:9153     ClusterIP      1 => 10.244.2.253:9153    
                                        2 => 10.244.1.110:9153    
3    10.111.80.144:80    ClusterIP      1 => 10.244.0.230:4245    
4    10.98.123.96:80     ClusterIP      1 => 10.244.1.162:12000   
5    10.96.0.1:443       ClusterIP      1 => 172.18.0.4:6443      
6    10.97.237.54:8080   ClusterIP      1 => 10.244.2.209:8080    

$ kg svc -n server-error -o wide
NAME            TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE   SELECTOR
liveness-http   ClusterIP   10.97.237.54   <none>        8080/TCP   27h   app=liveness-http

  1. Attached is the output of hubble observe -o json --pod server-error/curl after a single request from pod to service. flows.json.txt

seaunderwater avatar Aug 29 '20 19:08 seaunderwater

@seaunderwater Thank you!

It looks like the client pod (curl-5858f4ff79-4nt8q) seems to directly connect to the pod IP (10.244.2.209) after the DNS lookup returns. I don't see the service cluster IP (10.97.237.54) used at all. While this explains why Hubble cannot associate the service name (it performs association based on the service cluster IP), I still don't understand why your curl pod would direct access the backend (i.e. the pod IP) directly instead of going through the service cluster IP frontend first.

From the top of my head, I don't know of any other scenario outside of HostReachableServices in which that would happen. What does the service DNS lookup on liveness-http.server-error.svc.cluster.local resolve to, i.e. what IPs does it return?

gandro avatar Aug 31 '20 09:08 gandro

@glibsm This is why I am confused because as you can see in the flows, the L7 query is using the service DNS name, yet "destination" resolves to pod IP.

DNS lookup resolves to the service cluster IP:

 root@curl-5858f4ff79-4nt8q:/ ]$ nslookup liveness-http.server-error.svc.cluster.local
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      liveness-http.server-error.svc.cluster.local
Address 1: 10.97.237.54 liveness-http.server-error.svc.cluster.local

Going back to the following:

$ k exec -it cilium-zzdg8 -n kube-system -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement:   Partial   [eth0 (DR)]   [NodePort (SNAT, 30000-32767, XDP: DISABLED), HostPort, ExternalIPs, SessionAffinity]

Does XDP: DISABLED have any kind of implication on hubble observe --service ?

seaunderwater avatar Sep 03 '20 17:09 seaunderwater

Does XDP: DISABLED have any kind of implication on hubble observe --service ?

If XDP acceleration is enabled, it might affect the visible service translation for NodePort/LoadBalancer services, but in your case it's a ClusterIP service.

Would you mind posting the full output of cilium status, to make sure we are not missing anything?

gandro avatar Sep 07 '20 09:09 gandro

Seeing the same problem

cilium status -n default
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         OK
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         OK
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

DaemonSet         cilium             Desired: 2, Ready: 2/2, Available: 2/2
Deployment        cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
Deployment        hubble-relay       Desired: 1, Ready: 1/1, Available: 1/1
Deployment        hubble-ui          Desired: 1, Ready: 1/1, Available: 1/1
Containers:       hubble-ui          Running: 1
                  cilium             Running: 2
                  cilium-operator    Running: 2
                  hubble-relay       Running: 1
Cluster Pods:     22/62 managed by Cilium
Image versions    cilium             quay.io/cilium/cilium:v1.12.0-rc1: 2
                  cilium-operator    quay.io/cilium/operator-generic:v1.12.0-rc1: 2
                  hubble-relay       quay.io/cilium/hubble-relay:v1.12.0-rc1: 1
                  hubble-ui          quay.io/cilium/hubble-ui-backend:v0.8.5@sha256:2bce50cf6c32719d072706f7ceccad654bfa907b2745a496da99610776fe31ed: 1
                  hubble-ui          docker.io/envoyproxy/envoy:v1.18.2@sha256:e8b37c1d75787dd1e712ff389b0d37337dc8a174a63bed9c34ba73359dc67da7: 1
                  hubble-ui          quay.io/cilium/hubble-ui:v0.8.5@sha256:4eaca1ec1741043cfba6066a165b3bf251590cf4ac66371c4f63fbed2224ebb4: 1

lizrice avatar Apr 25 '22 17:04 lizrice

Can't persuade GitHub to let me upload a sysdump so for now I have put it here

lizrice avatar Apr 25 '22 17:04 lizrice