cilium-cli icon indicating copy to clipboard operation
cilium-cli copied to clipboard

Wireguard pod-to-pod-encryption tests are being run despite the cluster reporting it as disabled, causing a segfault

Open MTRNord opened this issue 1 year ago • 2 comments

Bug report

The tests fail with the following segfault:

[=] Test [pod-to-pod-encryption] [38/63]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x2812648]

goroutine 29259 [running]:
github.com/cilium/cilium-cli/connectivity/tests.getFilter({0x4366ea0, 0xc000aaa080}, 0xc0024fe3c0, 0xc001ab1040, 0xc001ab10c0, 0xc001ab1000, 0xc001ab1080, 0x1, 0x0, 0x0)
	/cilium/connectivity/tests/encryption.go:171 +0x8e8
github.com/cilium/cilium-cli/connectivity/tests.testNoTrafficLeak({0x4366ea0?, 0xc000aaa080}, 0xc0024fe3c0, {0x434a4c8, 0xc0024df440}, 0xc00060fb70?, 0xc001ab1040, 0xc000665b90?, 0x22?, 0x0, ...)
	/cilium/connectivity/tests/encryption.go:381 +0x1dd
github.com/cilium/cilium-cli/connectivity/tests.(*podToPodEncryption).Run.func1(0x2ef1b00?)
	/cilium/connectivity/tests/encryption.go:263 +0x65
github.com/cilium/cilium-cli/connectivity/check.(*Test).ForEachIPFamily(0xc0024fe3c0, 0xc012ad3ce0)
	/cilium/connectivity/check/test.go:808 +0x28e
github.com/cilium/cilium-cli/connectivity/tests.(*podToPodEncryption).Run(0xc0024df440, {0x4366ea0?, 0xc000aaa080}, 0xc0024fe3c0)
	/cilium/connectivity/tests/encryption.go:262 +0x5da
github.com/cilium/cilium-cli/connectivity/check.(*Test).Run(0xc0024fe3c0, {0x4366ea0, 0xc000aaa080}, 0x1b6c225?)
	/cilium/connectivity/check/test.go:329 +0x5fb
github.com/cilium/cilium-cli/connectivity/check.(*ConnectivityTest).Run.func1()
	/cilium/connectivity/check/context.go:405 +0x8c
created by github.com/cilium/cilium-cli/connectivity/check.(*ConnectivityTest).Run in goroutine 52
	/cilium/connectivity/check/context.go:402 +0x266

This is "expected" as the precondition needed (aka the pod running which it tries to access) is not met. However it feels even then weird that this is a segfault rather than an error. Though it shouldnt have went into this in the first place.

General Information

  • Cilium CLI version (run cilium version)
cilium-cli: v0.15.20 compiled with go1.21.6 on linux/amd64
cilium image (default): v1.14.5
cilium image (stable): v1.14.6
cilium image (running): 1.14.6
  • Orchestration system version in use (e.g. kubectl version, ...)
Client Version: v1.28.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.2
  • Platform / infrastructure information (e.g. AWS / Azure / GCP, image / kernel versions)

Bare-metal kubeadm cluster. control plane node is on a gentoo with a 6.1.60 kernel and 2 workers running nixos are on 6.7.1.

The control-plane is x86 and the 2 workers are arm64. All 3 nodes are allowed to run pods

  • Link to relevant artifacts (policies, deployments scripts, ...)

A lot of info is over at https://cilium.slack.com/archives/C1MATJ5U5/p1706192594540579

  • Generate and upload a system zip: cilium sysdump

(Hosted via matrix since its 2MB larger than what github allows here :( ) https://matrix.org/_matrix/media/v3/download/midnightthoughts.space/64ef2c6b31d3c8edab052443335f220439e64fb51750678141078077440

How to reproduce the issue

This is rather unclear. However here are some known hints:

The helm chart deployed is:

---
bpf:
  hostLegacyRouting: false
  masquerade: true
cluster:
  # -- Name of the cluster. Only required for Cluster Mesh and mutual authentication with SPIRE.
  name: <redacted>
  # -- (int) Unique ID of the cluster. Must be unique across all connected
  # clusters and in the range of 1 to 255. Only required for Cluster Mesh,
  # may be 0 if Cluster Mesh is not used.
  id: 0
cni:
  customConf: false
  uninstall: false
ipam:
  operator:
    clusterPoolIPv4PodCIDRList:
      - 10.245.0.0/16
    clusterPoolIPv6PodCIDRList:
      - fd00::/104
operator:
  unmanagedPodWatcher:
    restart: true
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true
  dashboards:
    enabled: true

policyEnforcementMode: default

kubeProxyReplacement: "true"

routingMode: tunnel
tunnelProtocol: vxlan
#tunnelProtocol: geneve
tunnel: vxlan
tunnelPort: 8473
sessionAffinity: true
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true
dashboards:
  enabled: true
hubble:
  relay:
    enabled: true
    prometheus:
      enabled: true
  ui:
    enabled: true
    metrics:
      enabled:
        - dns
        - tcp
        - httpV2
  metrics:
    enableOpenMetrics: true
    enabled:
      - dns:query;ignoreAAAA
      - drop
      - flow
      - flows-to-world
      - httpV2:exemplars=true;labelsContext=source_ip
#      - source_namespace
#      - source_workload
#      - destination_ip
#      - destination_namespace
#      - destination_workload
#      - traffic_direction
      - icmp
      - port-distribution
      - tcp
endpointStatus:
  enabled: true
  status: "policy"

nodePort:
  enabled: false

# Turn on after migration
l2announcements:
  enabled: true
k8sClientRateLimit:
  qps: 50
  burst: 100

k8sServiceHost: <redacted>
k8sServicePort: 6443

ipv6:
  enabled: true
rollOutCiliumPods: true

# Possibly broken
#enableIPv6Masquerade: false

#nat46x64Gateway:
#  enabled: true

The cluster at one point had wireguard encryption between nodes enabled via cilium which didnt work and hence was rolled back on the control plane. Since the nodes were locked out I did remove them via the normal kubeadm way and then readded them under the same node names.

The slack thread lead me to look at https://github.com/cilium/cilium-cli/blob/v0.15.20/connectivity/check/features.go#L185 which presumably is the precondition to be met for the tests to run. All 3 nodes however return:

  "encryption": {
    "mode": "Disabled"
  },

for cilium status -o json being run in the respective pods.

This is the state I am at

MTRNord avatar Jan 26 '24 00:01 MTRNord

I believe I got whats going on. The worker nodes have an arm taint. The daemonset however does not allow that. Hence only one of 3 pods is being started. This leads to "serverHost" being an empty variable. Which probably then segfaults

MTRNord avatar Jan 26 '24 10:01 MTRNord

Ok I confirmed it. the segfault is caused by the daemonset not working nicely with the taints. I will leave this open though as I believe it should be a test failure rather than a segfault :)

MTRNord avatar Jan 26 '24 10:01 MTRNord

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] avatar Sep 28 '24 01:09 github-actions[bot]

This issue has not seen any activity since it was marked stale. Closing.

github-actions[bot] avatar Oct 13 '24 02:10 github-actions[bot]

It looks like the issue still exists:

[=] [cilium-test-1] Test [node-to-node-encryption] [59/123]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x3c8 pc=0x1072e8a6c]

goroutine 35492 [running]:
github.com/cilium/cilium/cilium-cli/connectivity/check.Pod.Address({0x0, 0x0, {0x0, 0x0}, {0x0, 0x0}, 0x0, 0x0}, 0x1)
	github.com/cilium/[email protected]/cilium-cli/connectivity/check/peer.go:109 +0x3c
github.com/cilium/cilium/cilium-cli/connectivity/tests.getFilter({0x10a514278, 0x14000d8b840}, 0x1400132f8c0, 0x14001515d00, 0x14001515e80, 0x14001515ec0, 0x14001515ec0, 0x1, 0x1, 0x0)
	github.com/cilium/[email protected]/cilium-cli/connectivity/tests/encryption.go:171 +0x75c
github.com/cilium/cilium/cilium-cli/connectivity/tests.testNoTrafficLeak({0x10a514278, 0x14000d8b840}, 0x1400132f8c0, {0x10a4fea50, 0x14001369260}, 0x14001515d00, 0x14001515ec0, 0x14001515e80, 0x14001515ec0, 0x1, ...)
	github.com/cilium/[email protected]/cilium-cli/connectivity/tests/encryption.go:303 +0xb0
github.com/cilium/cilium/cilium-cli/connectivity/tests.(*nodeToNodeEncryption).Run.func1(0x1)
	github.com/cilium/[email protected]/cilium-cli/connectivity/tests/encryption.go:481 +0x8c
github.com/cilium/cilium/cilium-cli/connectivity/check.(*ConnectivityTest).ForEachIPFamily(0x14000a1ca88, 0x94?, 0x14000a2dc30)
	github.com/cilium/[email protected]/cilium-cli/connectivity/check/context.go:1326 +0x260
github.com/cilium/cilium/cilium-cli/connectivity/check.(*Test).ForEachIPFamily(0x1400132f8c0, 0x14000a2dc30)
	github.com/cilium/[email protected]/cilium-cli/connectivity/check/test.go:916 +0xac
github.com/cilium/cilium/cilium-cli/connectivity/tests.(*nodeToNodeEncryption).Run(0x14001369260, {0x10a514278, 0x14000d8b840}, 0x1400132f8c0)
	github.com/cilium/[email protected]/cilium-cli/connectivity/tests/encryption.go:472 +0x6fc
github.com/cilium/cilium/cilium-cli/connectivity/check.(*Test).Run(0x1400132f8c0, {0x10a514278, 0x14000d8b840}, 0x3b)
	github.com/cilium/[email protected]/cilium-cli/connectivity/check/test.go:397 +0x45c
github.com/cilium/cilium/cilium-cli/connectivity/check.(*ConnectivityTest).Run.func1()
	github.com/cilium/[email protected]/cilium-cli/connectivity/check/context.go:455 +0x68
created by github.com/cilium/cilium/cilium-cli/connectivity/check.(*ConnectivityTest).Run in goroutine 608
	github.com/cilium/[email protected]/cilium-cli/connectivity/check/context.go:449 +0x90

Versions:

cilium version
cilium-cli: v0.18.7 compiled with go1.25.0 on darwin/arm64
cilium image (default): v1.18.1
cilium image (stable): v1.18.2
cilium image (running): 1.18.2

Config:

ipam:
  mode: eni
routingMode: native
kubeProxyReplacement: true
egressMasqueradeInterfaces: eth+
eni:
  enabled: true
  updateEC2AdapterLimitViaAPI: true
cni:
  chainingMode: aws-cni
  exclusive: false
removeExternalCNI: true
encryption:
  enabled: true
  type: wireguard
  nodeEncryption: true
securityContext:
  capabilities:
    ciliumAgent:
      - CHOWN
      - KILL
      - NET_ADMIN
      - NET_RAW
      - IPC_LOCK
      - SYS_ADMIN
      - SYS_RESOURCE
      - DAC_OVERRIDE
      - FOWNER
      - SETGID
      - SETUID
    cleanCiliumState:
      - NET_ADMIN
      - SYS_ADMIN
      - SYS_RESOURCE

idyakonov-dev avatar Sep 26 '25 10:09 idyakonov-dev