retina icon indicating copy to clipboard operation
retina copied to clipboard

retina-agent pod fails to start HTTP server when running controller manager

Open arorainms opened this issue 1 year ago • 6 comments

Describe the bug make helm-install-advanced-remote-context

To Reproduce Steps to reproduce the behavior:

  1. Create a default AKS cluster with network monitoring enabled using the documentation
  2. Download the retina Git repo and run make helm-install-advanced-remote-context on the cluster.
  3. Retina operator pod works fine.
  4. Retina agent pod on all the nodes erroring with the error below:
ts=2024-03-22T06:49:25.754Z level=panic caller=controllermanager/controllermanager.go:118 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-nodepool###redacted###003 podname=retina-agent-m2tll version=v0.0.1 apiserver=https://myakscluster-###redacted###.azmk8s.io:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to start HTTP server: context canceled" errorVerbose="context canceled\nfailed to start HTTP server\ngithub.com/microsoft/retina/pkg/server.(*Server).Start\n\t/go/src/github.com/microsoft/retina/pkg/server/server.go:91\ngithub.com/microsoft/retina/pkg/managers/servermanager.(*HTTPServer).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/servermanager/servermanager.go:43\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func2\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:111\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
panic: Error running controller manager [recovered]
        panic: Error running controller manager

goroutine 87 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
        /go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0x242fc60?, 0xc000d2e6e0?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x0?, {0x0?, 0x0?, 0xc000775540?})
        /go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0007d3380, {0xc000d40440, 0x1, 0x1})
        /go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000bb91c0?, {0x2b48afa?, 0x0?}, {0xc000d40440, 0x1, 0x1})
        /go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0xc000a637c0, {0x2f057d0?, 0xc000447f90?})
        /go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:118 +0x28c
created by main.main in goroutine 1
        /go/src/github.com/microsoft/retina/controller/main.go:286 +0x2825

Expected behavior

  1. Retina agent pods should be up and running.
  2. adv_network_observability metrics should be available in Grafana

Screenshots If applicable, add screenshots to help explain your problem.

Platform (please complete the following information):

  • OS: Ubuntu
  • Kubernetes Version: 1.27.9
  • Host: [e.g. AKS, KIND, self-host, etc]: AKS
  • Retina Version: latest as per doc on retina.sh

Additional context NA.

arorainms avatar Mar 22 '24 07:03 arorainms

Can you confirm that the issue is arm64 and you are seeing an error in the logs of the crashing pod that contains fork/exec:

> kubectl logs pod/retina-agent-x24st  -n kube-system -p | grep fork/exec
Defaulted container "retina" out of: retina, init-retina (init)
ts=2024-03-27T20:40:29.538Z level=panic caller=controllermanager/controllermanager.go:118 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=arm64 numcores=4 hostname=aks-arm64-21082772-vmss000000 podname=retina-agent-x24st version=v0.0.1 apiserver=https://dnsprefix1-m92bhaec.hcp.eastus.azmk8s.io:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to reconcile plugin packetforward: fork/exec /bin/clang: no such file or directory" errorVerbose="fork/exec /bin/clang: no such file or directory\nfailed to reconcile plugin packetforward\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:169\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:108\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197"

jimassa avatar Mar 27 '24 20:03 jimassa

I have similar problem, and i set enablePodLevel is true, config.yaml:

apiServer:
  host: 0.0.0.0
  port: 10093
logLevel: debug
enabledPlugin: [dropreason,packetforward,linuxutil,dns,packetparser]
metricsInterval: 10
enableTelemetry: false
enablePodLevel: true
remoteContext: false
enableAnnotations: false

the retina-agent status is error:

# k -n kube-system get pods retina-agent-tl7kb -owide
NAME                 READY   STATUS   RESTARTS      AGE   IP            NODE          NOMINATED NODE   READINESS GATES
retina-agent-tl7kb   0/1     Error    1 (19s ago)   56s   10.165.6.24   10.165.6.24   <none>           <none>

# k -n kube-system logs -p retina-agent-tl7kb retina
......
ts=2024-03-28T01:56:24.074Z level=debug caller=dns/dns_linux.go:103 msg="Event received" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser event= eventError="unsupported value type"
ts=2024-03-28T01:56:24.074Z level=debug caller=enricher/enricher.go:133 msg="enriched flow" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser flow="time:{seconds:1711590984 nanos:74268601} verdict:16 IP:{source:\"10.165.115.72\" destination:\"10.165.119.214\" ipVersion:IPv4} source:{namespace:\"kube-system\" labels:\"k8s-app=kube-dns\" labels:\"pod-template-hash=865bc4c777\" pod_name:\"coredns-865bc4c777-6s8gb\" workloads:{name:\"coredns-865bc4c777\" kind:\"ReplicaSet\"}} Type:L7 l7:{type:RESPONSE dns:{query:\"metcd-2.metcd-headless.cluster.local.\" rcode:3 qtypes:\"A\"}} event_type:{type:129} traffic_direction:INGRESS trace_observation_point:FROM_NETWORK is_reply:{value:true} extensions:{[type.googleapis.com/utils.RetinaMetadata]:{dns_type:RESPONSE}}"
ts=2024-03-28T01:56:24.074Z level=info caller=server/server.go:71 msg="HTTP server stopped with err: http: Server closed" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser
ts=2024-03-28T01:56:24.074Z level=debug caller=dns/dns_linux.go:103 msg="Event received" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser event= eventError="unsupported value type"
ts=2024-03-28T01:56:24.074Z level=debug caller=dns/dns_linux.go:103 msg="Event received" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser event= eventError="unsupported value type"
ts=2024-03-28T01:56:24.074Z level=debug caller=enricher/enricher.go:88 msg="Enriching flow" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser flow="time:{seconds:1711590984 nanos:74283377} verdict:16 IP:{source:\"10.165.115.72\" destination:\"10.165.119.214\" ipVersion:IPv4} Type:L7 l7:{type:RESPONSE dns:{query:\"metcd-2.metcd-headless.cluster.local.\" rcode:3 qtypes:\"A\"}} event_type:{type:129} traffic_direction:EGRESS trace_observation_point:TO_NETWORK is_reply:{value:true} extensions:{[type.googleapis.com/utils.RetinaMetadata]:{dns_type:RESPONSE}}"
ts=2024-03-28T01:56:24.074Z level=debug caller=cache/cache.go:120 msg="pod found for IP" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser ip=10.165.115.72 pod=kube-system/coredns-865bc4c777-6s8gb
ts=2024-03-28T01:56:24.074Z level=debug caller=cache/cache.go:155 msg="pod found for IP" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser ip=10.165.115.72 pod Name=kube-system/coredns-865bc4c777-6s8gb
ts=2024-03-28T01:56:24.074Z level=panic caller=controllermanager/controllermanager.go:119 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to start plugin manager, plugin exited: failed to start plugin packetparser: interface eth0 of type device not found" errorVerbose="interface eth0 of type device not found\nfailed to start plugin packetparser\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:174\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\nfailed to start plugin manager, plugin exited\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:186\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:109\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
ts=2024-03-28T01:56:24.074Z level=debug caller=cache/cache.go:114 msg="pod not found for IP" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser ip=10.165.119.214
panic: Error running controller manager

goroutine 681 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x90?, {0x0?, 0x0?, 0xc005fe8940?})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00299f040, {0xc0066235c0, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000c13d00?, {0x2b4d495?, 0x8fb940?}, {0xc0066235c0, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0xc0006cf9a0, {0x2f0a790?, 0xc0008962d0?})
	/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:119 +0x28c
created by main.main in goroutine 1
	/go/src/github.com/microsoft/retina/controller/main.go:290 +0x28d0

installation mode: make helm-install-with-operator retina-agent daemonset yaml:

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: retina-agent
  namespace: kube-system
  labels:
    app.kubernetes.io/managed-by: Helm
    k8s-app: retina
  annotations:
    deprecated.daemonset.template.generation: '20'
    field.cattle.io/publicEndpoints: 'null'
    meta.helm.sh/release-name: retina
    meta.helm.sh/release-namespace: kube-system
spec:
  selector:
    matchLabels:
      app: retina
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: retina
        k8s-app: retina
      annotations:
        checksum/config: 48f843d88ced90f531a61ed0ee4f1e0f9bf256a47ac281655788542bf0f520fb
        kubesphere.io/restartedAt: '2024-03-27T09:42:56.531Z'
        prometheus.io/port: '10093'
        prometheus.io/scrape: 'true'
    spec:
      volumes:
        - name: bpf
          hostPath:
            path: /sys/fs/bpf
            type: ''
        - name: cgroup
          hostPath:
            path: /sys/fs/cgroup
            type: ''
        - name: config
          configMap:
            name: retina-config
            defaultMode: 420
        - name: debug
          hostPath:
            path: /sys/kernel/debug
            type: ''
        - name: tmp
          emptyDir: {}
        - name: trace
          hostPath:
            path: /sys/kernel/tracing
            type: ''
      initContainers:
        - name: init-retina
          image: '*****/retina-init:v0.0.2'
          resources: {}
          volumeMounts:
            - name: bpf
              mountPath: /sys/fs/bpf
              mountPropagation: Bidirectional
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: FallbackToLogsOnError
          imagePullPolicy: Always
          securityContext:
            privileged: true
      containers:
        - name: retina
          image: '***/retina-agent:v0.0.2'
          command:
            - /retina/controller
          args:
            - '--health-probe-bind-address=:18081'
            - '--metrics-bind-address=:18080'
            - '--config'
            - /retina/config/config.yaml
          ports:
            - hostPort: 10093
              containerPort: 10093
              protocol: TCP
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: NODE_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.hostIP
          resources:
            limits:
              cpu: 500m
              memory: 300Mi
          volumeMounts:
            - name: bpf
              mountPath: /sys/fs/bpf
            - name: cgroup
              mountPath: /sys/fs/cgroup
            - name: config
              mountPath: /retina/config
            - name: debug
              mountPath: /sys/kernel/debug
            - name: tmp
              mountPath: /tmp
            - name: trace
              mountPath: /sys/kernel/tracing
          livenessProbe:
            httpGet:
              path: /metrics
              port: 10093
              scheme: HTTP
            initialDelaySeconds: 30
            timeoutSeconds: 1
            periodSeconds: 30
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
                - SYS_RESOURCE
                - NET_ADMIN
                - IPC_LOCK
            privileged: false
      restartPolicy: Always
      terminationGracePeriodSeconds: 90
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: retina-agent
      serviceAccount: retina-agent
      hostNetwork: true
      securityContext: {}

** Platform (please complete the following information):**

OS: CentOS Linux 7 (Core) Kernel Version: 5.4.207-1.el7.elrepo.x86_64 Kubernetes Version: 1.22.2 Host: local kubernets Retina Version: v0.0.2

wenhuwang avatar Mar 28 '24 02:03 wenhuwang

The more accurate error log is as follows:

ts=2024-04-02T07:37:40.580Z level=info caller=server/server.go:71 msg="HTTP server stopped with err: http: Server closed" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-hvqtd version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser
ts=2024-04-02T07:37:40.580Z level=panic caller=controllermanager/controllermanager.go:119 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-hvqtd version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to start plugin manager, plugin exited: failed to start plugin packetparser: interface eth0 of type device not found" errorVerbose="interface eth0 of type device not found\nfailed to start plugin packetparser\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:174\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\nfailed to start plugin manager, plugin exited\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:186\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:109\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
panic: Error running controller manager

goroutine 679 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x90?, {0x0?, 0x0?, 0xc004b90dc0?})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc004b880d0, {0xc004fbc840, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc0010a7d00?, {0x2b4d495?, 0x8fb940?}, {0xc004fbc840, 0x1, 0x1})
	/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0xc000742cd0, {0x2f0a790?, 0xc000f16d70?})
	/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:119 +0x28c
created by main.main in goroutine 1
	/go/src/github.com/microsoft/retina/controller/main.go:290 +0x28d0

the errors code location is https://github.com/microsoft/retina/blob/main/pkg/plugin/packetparser/packetparser_linux.go#L212 , retina-agent will panic If the deployment node does not have an eth0 device, I think we need to rethink how to implement it here.

wenhuwang avatar Apr 02 '24 07:04 wenhuwang

Related: https://github.com/microsoft/retina/issues/111

anubhabMajumdar avatar Apr 03 '24 13:04 anubhabMajumdar

More context for the issue, point by me:

  1. Host Port 10093 in deployment spec of retina-agent seems a hard dependency for controller manager to start.
  2. When i changed the port to some other other port (as 10093 was used by other deployment kappie), controller manager was in panic and unable to start.
  3. Deleted the kappie deployment. Changed the deployment of retina-agent to 10093 and retina-agent worked fine. This shall now be good to close.

Long term fix: Remove the hard dependency on port 10093 in deployment spec and make it flexible to start up on any port. @jimassa thoughts?

arorainms avatar Apr 04 '24 16:04 arorainms

  1. Deleted the kappie deployment. Changed the deployment of retina-agent to 10093 and retina-agent worked fine. This shall now be good to close.

@arorainms after reading your last comment, I have just noticed this in your original report:

Create a default AKS cluster with network monitoring enabled

I think the issue here is actually a documentation problem. AKS Network Observability is Retina (our first-party, managed version, in preview using internal codename "kappie"). So I think the port conflict you encountered is due to accidentally trying to run Retina twice.

If you're using AKS, you can simply enable the managed NetObs service. And if you want to deploy from OSS, you shouldn't enable AKS NetObs.

rbtr avatar Apr 05 '24 03:04 rbtr