cilium
cilium copied to clipboard
v1.10 backports 2022-08-09
- #20624 -- docs: update etcd kvstore migration instructions (@hhoover)
- #20564 -- contrib: Add CRD generation to release process (@joestringer)
- #20649 -- daemon: Improve dnsproxy error when EP not found (@joestringer)
- #20643 -- helm: Guard apply sysctl init container (@sayboras)
- #20673 -- command: fix parsing of string map strings with multiple separators (@tklauser)
- #20697 -- clustermesh: Add EndpointSlice support for API server (@YutaroHayakawa)
- #20685 -- ci: fix code changes detection on
push
events (@nbusseneau) - #20680 -- iptables: handle case where kernel IPv6 support is disabled (@jibi)
- #20449 -- fix subnet_id label value is empty (@wu0407)
- #20741 -- Fix ineffective post-start hook in ENI mode (@bmcustodio)
- #20757 -- pkg/k8s: set the right IP addresses in log messages (@aanm)
- #20750 -- Consider
$GO
environment variablemake precheck
checks (@tklauser)
Once this PR is merged, you can update the PR labels via:
$ for pr in 20624 20564 20649 20643 20673 20697 20685 20680 20449 20741 20757 20750; do contrib/backporting/set-labels.py $pr done 1.10; done
@sayboras @tklauser @YutaroHayakawa @bmcustodio Please pay extra attention to the backporter's notes re: conflicts handling in your commits.
/test-backport-1.10
Job 'Cilium-PR-K8s-GKE' failed:
Click to show.
Test Name
K8sLRPTests Checks local redirect policy LRP connectivity
Failure Output
FAIL: Cilium operator was not able to get into ready state
If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE
so I can create one.
GKE failed with likely flake: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/9017/
Cilium operator was not able to get into ready state
Expected
<*errors.errorString | 0xc00164c120>: {
s: "timed out waiting for pods with filter -l name=cilium-operator to be ready: 10m0s timeout expired",
}
to be nil
Re-running to check.
Multicluster testing https://github.com/cilium/cilium/actions/runs/2825795501 failed with the "regular DNS issue":
❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://10.0.0.112:8080/" failed: command terminated with exit code 28
ℹ️ curl output:
curl: (28) Connection timeout after 5000 ms
:0 -> :0 = 000
❌ allow-all-except-world/pod-to-service/curl-0: cilium-test/client-6488dcf5d4-gc7ch (10.52.1.188) -> cilium-test/echo-other-node (echo-other-node:8080)
❌ allow-all-except-world/pod-to-service/curl-2: cilium-test/client2-5998d566b4-4kgs5 (10.52.1.74) -> cilium-test/echo-other-node (echo-other-node:8080)
❌ allow-all-except-world/pod-to-pod/curl-1: cilium-test/client-6488dcf5d4-gc7ch (10.52.1.188) -> cilium-test/echo-other-node-f4d46f75b-t8vqm (10.0.0.112:8080)
Which issue should we link to here? I can't seem to re-find the one appropriate this time, only ones specific to some situations (e.g. encryption enabled). Maybe @joestringer remembers?
How do you know it's a DNS issue? The command looks like it's connecting directly to http://10.0.0.112:8080/.
I don't have a good feel for the cilium-cli connectivity test failures other than I recall the encryption one specifically shows up after one successful non-encrypted run, which you can tell by looking at the steps that ran previously in the github actions page. It's not that one since this is multicluster and this happened just on first cluster bootstrap.
At a glance it looks similar to https://github.com/cilium/cilium/issues/20186, but I don't think there's enough detail to tell if there are significant differences in the symptoms. That one was reported against v1.11 though.
Oh wow. I said DNS because earlier in the logs, there are these:
❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://echo-other-node:8080" failed: command terminated with exit code 28
ℹ️ curl output:
curl: (28) Connection timeout after 5001 ms
:0 -> :0 = 000
I had missed the IP failures, including the one I copied (thought they were all the same). The issue you linked is similar and has both the same kind of failures.
L4LB tests https://github.com/cilium/cilium/actions/runs/2825795538 failed with a known issue that's addressed by #20682 and #20834.
GKE failed with likely flake: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/9017/
Cilium operator was not able to get into ready state Expected <*errors.errorString | 0xc00164c120>: { s: "timed out waiting for pods with filter -l name=cilium-operator to be ready: 10m0s timeout expired", } to be nil
Re-running to check.
Looks like it flaked on a different test (K8sCLI CLI Identity CLI testing Test cilium bpf metrics list) on re-run: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/9023/
Cannot get cilium pod on k8s2
Expected
<*errors.errorString | 0xc000367740>: {
s: "Unable to get nodes with label 'k8s2': no matching node to read name with label 'k8s2'",
}
to be nil
Couldn't find another issue with the same symptoms, so I've created #20866
/test-gke
Approvals are in, namely for backports that needed special attention due to conflicts. All remaining test failures are known flakes (multicluster) or will be followed up on (L4LB). Merged.