retina-agent pod fails to start HTTP server when running controller manager
Describe the bug make helm-install-advanced-remote-context
To Reproduce Steps to reproduce the behavior:
- Create a default AKS cluster with network monitoring enabled using the documentation
- Download the retina Git repo and run
make helm-install-advanced-remote-contexton the cluster. - Retina operator pod works fine.
- Retina agent pod on all the nodes erroring with the error below:
ts=2024-03-22T06:49:25.754Z level=panic caller=controllermanager/controllermanager.go:118 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=amd64 numcores=2 hostname=aks-nodepool###redacted###003 podname=retina-agent-m2tll version=v0.0.1 apiserver=https://myakscluster-###redacted###.azmk8s.io:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to start HTTP server: context canceled" errorVerbose="context canceled\nfailed to start HTTP server\ngithub.com/microsoft/retina/pkg/server.(*Server).Start\n\t/go/src/github.com/microsoft/retina/pkg/server/server.go:91\ngithub.com/microsoft/retina/pkg/managers/servermanager.(*HTTPServer).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/servermanager/servermanager.go:43\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func2\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:111\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
panic: Error running controller manager [recovered]
panic: Error running controller manager
goroutine 87 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0x242fc60?, 0xc000d2e6e0?})
/usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x0?, {0x0?, 0x0?, 0xc000775540?})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0007d3380, {0xc000d40440, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000bb91c0?, {0x2b48afa?, 0x0?}, {0xc000d40440, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0xc000a637c0, {0x2f057d0?, 0xc000447f90?})
/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:118 +0x28c
created by main.main in goroutine 1
/go/src/github.com/microsoft/retina/controller/main.go:286 +0x2825
Expected behavior
- Retina agent pods should be up and running.
- adv_network_observability metrics should be available in Grafana
Screenshots If applicable, add screenshots to help explain your problem.
Platform (please complete the following information):
- OS: Ubuntu
- Kubernetes Version: 1.27.9
- Host: [e.g. AKS, KIND, self-host, etc]: AKS
- Retina Version: latest as per doc on retina.sh
Additional context NA.
Can you confirm that the issue is arm64 and you are seeing an error in the logs of the crashing pod that contains fork/exec:
> kubectl logs pod/retina-agent-x24st -n kube-system -p | grep fork/exec
Defaulted container "retina" out of: retina, init-retina (init)
ts=2024-03-27T20:40:29.538Z level=panic caller=controllermanager/controllermanager.go:118 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=arm64 numcores=4 hostname=aks-arm64-21082772-vmss000000 podname=retina-agent-x24st version=v0.0.1 apiserver=https://dnsprefix1-m92bhaec.hcp.eastus.azmk8s.io:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to reconcile plugin packetforward: fork/exec /bin/clang: no such file or directory" errorVerbose="fork/exec /bin/clang: no such file or directory\nfailed to reconcile plugin packetforward\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:169\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:108\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197"
I have similar problem, and i set enablePodLevel is true, config.yaml:
apiServer:
host: 0.0.0.0
port: 10093
logLevel: debug
enabledPlugin: [dropreason,packetforward,linuxutil,dns,packetparser]
metricsInterval: 10
enableTelemetry: false
enablePodLevel: true
remoteContext: false
enableAnnotations: false
the retina-agent status is error:
# k -n kube-system get pods retina-agent-tl7kb -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
retina-agent-tl7kb 0/1 Error 1 (19s ago) 56s 10.165.6.24 10.165.6.24 <none> <none>
# k -n kube-system logs -p retina-agent-tl7kb retina
......
ts=2024-03-28T01:56:24.074Z level=debug caller=dns/dns_linux.go:103 msg="Event received" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser event= eventError="unsupported value type"
ts=2024-03-28T01:56:24.074Z level=debug caller=enricher/enricher.go:133 msg="enriched flow" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser flow="time:{seconds:1711590984 nanos:74268601} verdict:16 IP:{source:\"10.165.115.72\" destination:\"10.165.119.214\" ipVersion:IPv4} source:{namespace:\"kube-system\" labels:\"k8s-app=kube-dns\" labels:\"pod-template-hash=865bc4c777\" pod_name:\"coredns-865bc4c777-6s8gb\" workloads:{name:\"coredns-865bc4c777\" kind:\"ReplicaSet\"}} Type:L7 l7:{type:RESPONSE dns:{query:\"metcd-2.metcd-headless.cluster.local.\" rcode:3 qtypes:\"A\"}} event_type:{type:129} traffic_direction:INGRESS trace_observation_point:FROM_NETWORK is_reply:{value:true} extensions:{[type.googleapis.com/utils.RetinaMetadata]:{dns_type:RESPONSE}}"
ts=2024-03-28T01:56:24.074Z level=info caller=server/server.go:71 msg="HTTP server stopped with err: http: Server closed" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser
ts=2024-03-28T01:56:24.074Z level=debug caller=dns/dns_linux.go:103 msg="Event received" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser event= eventError="unsupported value type"
ts=2024-03-28T01:56:24.074Z level=debug caller=dns/dns_linux.go:103 msg="Event received" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser event= eventError="unsupported value type"
ts=2024-03-28T01:56:24.074Z level=debug caller=enricher/enricher.go:88 msg="Enriching flow" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser flow="time:{seconds:1711590984 nanos:74283377} verdict:16 IP:{source:\"10.165.115.72\" destination:\"10.165.119.214\" ipVersion:IPv4} Type:L7 l7:{type:RESPONSE dns:{query:\"metcd-2.metcd-headless.cluster.local.\" rcode:3 qtypes:\"A\"}} event_type:{type:129} traffic_direction:EGRESS trace_observation_point:TO_NETWORK is_reply:{value:true} extensions:{[type.googleapis.com/utils.RetinaMetadata]:{dns_type:RESPONSE}}"
ts=2024-03-28T01:56:24.074Z level=debug caller=cache/cache.go:120 msg="pod found for IP" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser ip=10.165.115.72 pod=kube-system/coredns-865bc4c777-6s8gb
ts=2024-03-28T01:56:24.074Z level=debug caller=cache/cache.go:155 msg="pod found for IP" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser ip=10.165.115.72 pod Name=kube-system/coredns-865bc4c777-6s8gb
ts=2024-03-28T01:56:24.074Z level=panic caller=controllermanager/controllermanager.go:119 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to start plugin manager, plugin exited: failed to start plugin packetparser: interface eth0 of type device not found" errorVerbose="interface eth0 of type device not found\nfailed to start plugin packetparser\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:174\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\nfailed to start plugin manager, plugin exited\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:186\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:109\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
ts=2024-03-28T01:56:24.074Z level=debug caller=cache/cache.go:114 msg="pod not found for IP" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-tl7kb version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser ip=10.165.119.214
panic: Error running controller manager
goroutine 681 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x90?, {0x0?, 0x0?, 0xc005fe8940?})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00299f040, {0xc0066235c0, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000c13d00?, {0x2b4d495?, 0x8fb940?}, {0xc0066235c0, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0xc0006cf9a0, {0x2f0a790?, 0xc0008962d0?})
/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:119 +0x28c
created by main.main in goroutine 1
/go/src/github.com/microsoft/retina/controller/main.go:290 +0x28d0
installation mode: make helm-install-with-operator
retina-agent daemonset yaml:
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: retina-agent
namespace: kube-system
labels:
app.kubernetes.io/managed-by: Helm
k8s-app: retina
annotations:
deprecated.daemonset.template.generation: '20'
field.cattle.io/publicEndpoints: 'null'
meta.helm.sh/release-name: retina
meta.helm.sh/release-namespace: kube-system
spec:
selector:
matchLabels:
app: retina
template:
metadata:
creationTimestamp: null
labels:
app: retina
k8s-app: retina
annotations:
checksum/config: 48f843d88ced90f531a61ed0ee4f1e0f9bf256a47ac281655788542bf0f520fb
kubesphere.io/restartedAt: '2024-03-27T09:42:56.531Z'
prometheus.io/port: '10093'
prometheus.io/scrape: 'true'
spec:
volumes:
- name: bpf
hostPath:
path: /sys/fs/bpf
type: ''
- name: cgroup
hostPath:
path: /sys/fs/cgroup
type: ''
- name: config
configMap:
name: retina-config
defaultMode: 420
- name: debug
hostPath:
path: /sys/kernel/debug
type: ''
- name: tmp
emptyDir: {}
- name: trace
hostPath:
path: /sys/kernel/tracing
type: ''
initContainers:
- name: init-retina
image: '*****/retina-init:v0.0.2'
resources: {}
volumeMounts:
- name: bpf
mountPath: /sys/fs/bpf
mountPropagation: Bidirectional
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
imagePullPolicy: Always
securityContext:
privileged: true
containers:
- name: retina
image: '***/retina-agent:v0.0.2'
command:
- /retina/controller
args:
- '--health-probe-bind-address=:18081'
- '--metrics-bind-address=:18080'
- '--config'
- /retina/config/config.yaml
ports:
- hostPort: 10093
containerPort: 10093
protocol: TCP
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: NODE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
resources:
limits:
cpu: 500m
memory: 300Mi
volumeMounts:
- name: bpf
mountPath: /sys/fs/bpf
- name: cgroup
mountPath: /sys/fs/cgroup
- name: config
mountPath: /retina/config
- name: debug
mountPath: /sys/kernel/debug
- name: tmp
mountPath: /tmp
- name: trace
mountPath: /sys/kernel/tracing
livenessProbe:
httpGet:
path: /metrics
port: 10093
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 1
periodSeconds: 30
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: Always
securityContext:
capabilities:
add:
- SYS_ADMIN
- SYS_RESOURCE
- NET_ADMIN
- IPC_LOCK
privileged: false
restartPolicy: Always
terminationGracePeriodSeconds: 90
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: retina-agent
serviceAccount: retina-agent
hostNetwork: true
securityContext: {}
** Platform (please complete the following information):**
OS: CentOS Linux 7 (Core) Kernel Version: 5.4.207-1.el7.elrepo.x86_64 Kubernetes Version: 1.22.2 Host: local kubernets Retina Version: v0.0.2
The more accurate error log is as follows:
ts=2024-04-02T07:37:40.580Z level=info caller=server/server.go:71 msg="HTTP server stopped with err: http: Server closed" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-hvqtd version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser
ts=2024-04-02T07:37:40.580Z level=panic caller=controllermanager/controllermanager.go:119 msg="Error running controller manager" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=node4 podname=retina-agent-hvqtd version=v0.0.2 apiserver=https://10.68.0.1:443 plugins=dropreason,packetforward,linuxutil,dns,packetparser error="failed to start plugin manager, plugin exited: failed to start plugin packetparser: interface eth0 of type device not found" errorVerbose="interface eth0 of type device not found\nfailed to start plugin packetparser\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:174\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\nfailed to start plugin manager, plugin exited\ngithub.com/microsoft/retina/pkg/managers/pluginmanager.(*PluginManager).Start\n\t/go/src/github.com/microsoft/retina/pkg/managers/pluginmanager/pluginmanager.go:186\ngithub.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start.func1\n\t/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:109\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"
panic: Error running controller manager
goroutine 679 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x90?, {0x0?, 0x0?, 0xc004b90dc0?})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc004b880d0, {0xc004fbc840, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc0010a7d00?, {0x2b4d495?, 0x8fb940?}, {0xc004fbc840, 0x1, 0x1})
/go/pkg/mod/go.uber.org/[email protected]/logger.go:284 +0x51
github.com/microsoft/retina/pkg/managers/controllermanager.(*Controller).Start(0xc000742cd0, {0x2f0a790?, 0xc000f16d70?})
/go/src/github.com/microsoft/retina/pkg/managers/controllermanager/controllermanager.go:119 +0x28c
created by main.main in goroutine 1
/go/src/github.com/microsoft/retina/controller/main.go:290 +0x28d0
the errors code location is https://github.com/microsoft/retina/blob/main/pkg/plugin/packetparser/packetparser_linux.go#L212 , retina-agent will panic If the deployment node does not have an eth0 device, I think we need to rethink how to implement it here.
Related: https://github.com/microsoft/retina/issues/111
More context for the issue, point by me:
- Host Port 10093 in deployment spec of retina-agent seems a hard dependency for controller manager to start.
- When i changed the port to some other other port (as 10093 was used by other deployment kappie), controller manager was in panic and unable to start.
- Deleted the kappie deployment. Changed the deployment of retina-agent to 10093 and retina-agent worked fine. This shall now be good to close.
Long term fix: Remove the hard dependency on port 10093 in deployment spec and make it flexible to start up on any port. @jimassa thoughts?
- Deleted the kappie deployment. Changed the deployment of retina-agent to 10093 and retina-agent worked fine. This shall now be good to close.
@arorainms after reading your last comment, I have just noticed this in your original report:
Create a default AKS cluster with network monitoring enabled
I think the issue here is actually a documentation problem. AKS Network Observability is Retina (our first-party, managed version, in preview using internal codename "kappie"). So I think the port conflict you encountered is due to accidentally trying to run Retina twice.
If you're using AKS, you can simply enable the managed NetObs service. And if you want to deploy from OSS, you shouldn't enable AKS NetObs.