fedora-coreos-pipeline
fedora-coreos-pipeline copied to clipboard
network infra flakes for quay.io cdn DNS
We occasionally see a DNS flake when utilizing our aarch64 mutli-arch builder.
[2023-04-11T21:46:41.208Z] + cosa remote-session create --image quay.io/coreos-assembler/coreos-assembler:main --expiration 4h --workdir /home/jenkins/agent/workspace/kola-upgrade
[2023-04-11T21:46:41.208Z] notice: failed to look up uid in /etc/passwd; enabling workaround
[2023-04-11T21:46:41.463Z] Trying to pull quay.io/coreos-assembler/coreos-assembler:main...
[2023-04-11T21:46:41.720Z] Error: copying system image from manifest list: parsing image configuration: Get "https://cdn03.quay.io/sha256/ff/ff59ae06a00f4d7543304a98dc73e8673786327b2dec2e853547b98c762c354b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230411%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230411T214641Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=b6996d8ba1615daa726fd54fa1c0b3bf07f1b53c7413f1bc2c8c19be7b9e86ba&cf_sign=c14ihmYA50IPx0pEFD5QKHb9lWxFjkCHqeHsSHAboHM3edzLcFyxdLso5XVbxvk9QQlU3k1%2B03axO8emqmmh6sdm7gfaO4LbyYPUg0S7lKiaNEp5E6QhxUO2gCot3m0qHUtIgEz3KNX6wWwPFIHIsUbMjR5VUuJdHFR%2B36RYJo5J4w3g1BvDIcwRjiBml6GIKlfWCvImELxRZtS1%2FISds3stNENUJCTv%2FFgiygbuJrLKumDONeTFlAFgYnlNqM1uSuB2qt%2FJgJaYkoSuBlcPMQpU37bMe9TEYwJUnKjh4Fdqy9ywBQ8tiyJ51VtsJPfalWoboG8hNJ%2FnFv2INWwYeQ%3D%3D&cf_expiry=1681250201®ion=us-east-1": dial tcp: lookup cdn03.quay.io: no such host
[2023-04-11T21:46:41.720Z] Error: exit status 125
[2023-04-11T21:46:41.720Z] Usage:
[2023-04-11T21:46:41.720Z] remote-session create [flags]
[2023-04-11T21:46:41.720Z]
[2023-04-11T21:46:41.720Z] Flags:
[2023-04-11T21:46:41.720Z] --expiration string The amount of time before the remote-session auto-exits (default "infinity")
[2023-04-11T21:46:41.720Z] -h, --help help for create
[2023-04-11T21:46:41.720Z] --image string The COSA container image to use on the remote (default "quay.io/coreos-assembler/coreos-assembler:main")
[2023-04-11T21:46:41.720Z] --workdir string The COSA working directory to use inside the container (default "/srv")
[2023-04-11T21:46:41.720Z]
[2023-04-11T21:46:41.720Z] error: exit status 125
We only seem to see this on our aarch64 builder, which is located in AWS, which is also where quay's infra is hosted IIUC.
Should we add a retry knob to cosa remote-session create?
Probably. I'm not sure how intermittent the network problem is. It might resolve itself in a second or it might be something that lasts 10s of seconds. So we'd have to experiment.
We could also experiment with using DNS from outside AWS on that builder and see if that helps.
at least podman build has the ability to retry when pulling from the registry. I see no such options for podman run
We discussed this out-of-band. There's no retry for podman pull either, but we could retry it e.g. 3 times.
xref:
- https://github.com/containers/podman/issues/16973
@dustymabe disabling systemd-resolved fixed everything for us. We went from dozens of flakes per day, to zero in a month --- except, we're still seeing the flake in Fedora gating tests, a different setup than Cirrus, on which I have not disabled systemd-resolved (it has been on my TODO list for two weeks). And no, this is not an AWS-only issue. Anywhere that systemd-resolved is used, it will flake.
In the same boat here, we disable systemd-resolved on Testing Farm workers back in 2021 or so, no more weird DNS issues afterwards :(
I will follow up on this tomorrow, seems it is time to find the root cause of this problem.
Until then, we will most probably just disable it as a workaround in Fedora CI, CentOS Stream CI and Packit
I chimed in over in https://github.com/containers/podman/issues/19770#issuecomment-1942376610
We should be able to switch to running a podman pull with a --retry once https://github.com/containers/podman/commit/80b1e957000aec4b86f55691b8ceb0dd37308d36 lands in a FCOS release.