cilium v1.10 backports 2022-08-09

#20624 -- docs: update etcd kvstore migration instructions (@hhoover)
#20564 -- contrib: Add CRD generation to release process (@joestringer)
#20649 -- daemon: Improve dnsproxy error when EP not found (@joestringer)
#20643 -- helm: Guard apply sysctl init container (@sayboras)
#20673 -- command: fix parsing of string map strings with multiple separators (@tklauser)
#20697 -- clustermesh: Add EndpointSlice support for API server (@YutaroHayakawa)
#20685 -- ci: fix code changes detection on push events (@nbusseneau)
#20680 -- iptables: handle case where kernel IPv6 support is disabled (@jibi)
#20449 -- fix subnet_id label value is empty (@wu0407)
#20741 -- Fix ineffective post-start hook in ENI mode (@bmcustodio)
#20757 -- pkg/k8s: set the right IP addresses in log messages (@aanm)
#20750 -- Consider $GO environment variable make precheck checks (@tklauser)

Once this PR is merged, you can update the PR labels via:

$ for pr in 20624 20564 20649 20643 20673 20697 20685 20680 20449 20741 20757 20750; do contrib/backporting/set-labels.py $pr done 1.10; done

Aug 09 '22 12:08 nbusseneau

@sayboras @tklauser @YutaroHayakawa @bmcustodio Please pay extra attention to the backporter's notes re: conflicts handling in your commits.

Aug 09 '22 12:08 nbusseneau

/test-backport-1.10

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sLRPTests Checks local redirect policy LRP connectivity

Failure Output

FAIL: Cilium operator was not able to get into ready state

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

Aug 09 '22 13:08 nbusseneau

GKE failed with likely flake: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/9017/

Cilium operator was not able to get into ready state
Expected
    <*errors.errorString | 0xc00164c120>: {
        s: "timed out waiting for pods with filter -l name=cilium-operator to be ready: 10m0s timeout expired",
    }
to be nil

Re-running to check.

Aug 09 '22 16:08 nbusseneau

Multicluster testing https://github.com/cilium/cilium/actions/runs/2825795501 failed with the "regular DNS issue":

❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.0.112:8080/" failed: command terminated with exit code 28
  ℹ️  curl output:
  curl: (28) Connection timeout after 5000 ms
:0 -> :0 = 000

  ❌ allow-all-except-world/pod-to-service/curl-0: cilium-test/client-6488dcf5d4-gc7ch (10.52.1.188) -> cilium-test/echo-other-node (echo-other-node:8080)
  ❌ allow-all-except-world/pod-to-service/curl-2: cilium-test/client2-5998d566b4-4kgs5 (10.52.1.74) -> cilium-test/echo-other-node (echo-other-node:8080)
  ❌ allow-all-except-world/pod-to-pod/curl-1: cilium-test/client-6488dcf5d4-gc7ch (10.52.1.188) -> cilium-test/echo-other-node-f4d46f75b-t8vqm (10.0.0.112:8080)

Which issue should we link to here? I can't seem to re-find the one appropriate this time, only ones specific to some situations (e.g. encryption enabled). Maybe @joestringer remembers?

Aug 09 '22 16:08 nbusseneau

How do you know it's a DNS issue? The command looks like it's connecting directly to http://10.0.0.112:8080/.

I don't have a good feel for the cilium-cli connectivity test failures other than I recall the encryption one specifically shows up after one successful non-encrypted run, which you can tell by looking at the steps that ran previously in the github actions page. It's not that one since this is multicluster and this happened just on first cluster bootstrap.

Aug 09 '22 17:08 joestringer

At a glance it looks similar to https://github.com/cilium/cilium/issues/20186, but I don't think there's enough detail to tell if there are significant differences in the symptoms. That one was reported against v1.11 though.

Aug 09 '22 17:08 joestringer

Oh wow. I said DNS because earlier in the logs, there are these:

   ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://echo-other-node:8080" failed: command terminated with exit code 28
  ℹ️  curl output:
  curl: (28) Connection timeout after 5001 ms
:0 -> :0 = 000

I had missed the IP failures, including the one I copied (thought they were all the same). The issue you linked is similar and has both the same kind of failures.

Aug 10 '22 10:08 nbusseneau

L4LB tests https://github.com/cilium/cilium/actions/runs/2825795538 failed with a known issue that's addressed by #20682 and #20834.

Aug 10 '22 14:08 nbusseneau

GKE failed with likely flake: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/9017/
Cilium operator was not able to get into ready state
Expected
    <*errors.errorString | 0xc00164c120>: {
        s: "timed out waiting for pods with filter -l name=cilium-operator to be ready: 10m0s timeout expired",
    }
to be nil
Re-running to check.

Looks like it flaked on a different test (K8sCLI CLI Identity CLI testing Test cilium bpf metrics list) on re-run: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/9023/

Cannot get cilium pod on k8s2
Expected
    <*errors.errorString | 0xc000367740>: {
        s: "Unable to get nodes with label 'k8s2': no matching node to read name with label 'k8s2'",
    }
to be nil

Couldn't find another issue with the same symptoms, so I've created #20866

Aug 11 '22 08:08 tklauser

/test-gke

Aug 11 '22 08:08 tklauser

Approvals are in, namely for backports that needed special attention due to conflicts. All remaining test failures are known flakes (multicluster) or will be followed up on (L4LB). Merged.

Aug 11 '22 09:08 tklauser

cilium cilium copied to clipboard

v1.10 backports 2022-08-09

Test Name

Failure Output

cilium
cilium copied to clipboard