fedora-coreos-pipeline icon indicating copy to clipboard operation
fedora-coreos-pipeline copied to clipboard

network infra flakes for quay.io cdn DNS

Open dustymabe opened this issue 2 years ago • 10 comments
trafficstars

We occasionally see a DNS flake when utilizing our aarch64 mutli-arch builder.

[2023-04-11T21:46:41.208Z] + cosa remote-session create --image quay.io/coreos-assembler/coreos-assembler:main --expiration 4h --workdir /home/jenkins/agent/workspace/kola-upgrade
[2023-04-11T21:46:41.208Z] notice: failed to look up uid in /etc/passwd; enabling workaround
[2023-04-11T21:46:41.463Z] Trying to pull quay.io/coreos-assembler/coreos-assembler:main...
[2023-04-11T21:46:41.720Z] Error: copying system image from manifest list: parsing image configuration: Get "https://cdn03.quay.io/sha256/ff/ff59ae06a00f4d7543304a98dc73e8673786327b2dec2e853547b98c762c354b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230411%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230411T214641Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=b6996d8ba1615daa726fd54fa1c0b3bf07f1b53c7413f1bc2c8c19be7b9e86ba&cf_sign=c14ihmYA50IPx0pEFD5QKHb9lWxFjkCHqeHsSHAboHM3edzLcFyxdLso5XVbxvk9QQlU3k1%2B03axO8emqmmh6sdm7gfaO4LbyYPUg0S7lKiaNEp5E6QhxUO2gCot3m0qHUtIgEz3KNX6wWwPFIHIsUbMjR5VUuJdHFR%2B36RYJo5J4w3g1BvDIcwRjiBml6GIKlfWCvImELxRZtS1%2FISds3stNENUJCTv%2FFgiygbuJrLKumDONeTFlAFgYnlNqM1uSuB2qt%2FJgJaYkoSuBlcPMQpU37bMe9TEYwJUnKjh4Fdqy9ywBQ8tiyJ51VtsJPfalWoboG8hNJ%2FnFv2INWwYeQ%3D%3D&cf_expiry=1681250201&region=us-east-1": dial tcp: lookup cdn03.quay.io: no such host
[2023-04-11T21:46:41.720Z] Error: exit status 125
[2023-04-11T21:46:41.720Z] Usage:
[2023-04-11T21:46:41.720Z]   remote-session create [flags]
[2023-04-11T21:46:41.720Z] 
[2023-04-11T21:46:41.720Z] Flags:
[2023-04-11T21:46:41.720Z]       --expiration string   The amount of time before the remote-session auto-exits (default "infinity")
[2023-04-11T21:46:41.720Z]   -h, --help                help for create
[2023-04-11T21:46:41.720Z]       --image string        The COSA container image to use on the remote (default "quay.io/coreos-assembler/coreos-assembler:main")
[2023-04-11T21:46:41.720Z]       --workdir string      The COSA working directory to use inside the container (default "/srv")
[2023-04-11T21:46:41.720Z] 
[2023-04-11T21:46:41.720Z] error: exit status 125

We only seem to see this on our aarch64 builder, which is located in AWS, which is also where quay's infra is hosted IIUC.

dustymabe avatar Apr 11 '23 23:04 dustymabe

Should we add a retry knob to cosa remote-session create?

jlebon avatar Apr 13 '23 16:04 jlebon

Probably. I'm not sure how intermittent the network problem is. It might resolve itself in a second or it might be something that lasts 10s of seconds. So we'd have to experiment.

dustymabe avatar Apr 13 '23 19:04 dustymabe

We could also experiment with using DNS from outside AWS on that builder and see if that helps.

dustymabe avatar Apr 13 '23 19:04 dustymabe

at least podman build has the ability to retry when pulling from the registry. I see no such options for podman run

dustymabe avatar Apr 13 '23 19:04 dustymabe

We discussed this out-of-band. There's no retry for podman pull either, but we could retry it e.g. 3 times.

jlebon avatar Apr 28 '23 13:04 jlebon

xref:

  • https://github.com/containers/podman/issues/16973

dustymabe avatar May 05 '23 20:05 dustymabe

@dustymabe disabling systemd-resolved fixed everything for us. We went from dozens of flakes per day, to zero in a month --- except, we're still seeing the flake in Fedora gating tests, a different setup than Cirrus, on which I have not disabled systemd-resolved (it has been on my TODO list for two weeks). And no, this is not an AWS-only issue. Anywhere that systemd-resolved is used, it will flake.

edsantiago avatar May 08 '23 10:05 edsantiago

In the same boat here, we disable systemd-resolved on Testing Farm workers back in 2021 or so, no more weird DNS issues afterwards :(

I will follow up on this tomorrow, seems it is time to find the root cause of this problem.

Until then, we will most probably just disable it as a workaround in Fedora CI, CentOS Stream CI and Packit

thrix avatar Jul 24 '23 17:07 thrix

I chimed in over in https://github.com/containers/podman/issues/19770#issuecomment-1942376610

dustymabe avatar Feb 13 '24 20:02 dustymabe

We should be able to switch to running a podman pull with a --retry once https://github.com/containers/podman/commit/80b1e957000aec4b86f55691b8ceb0dd37308d36 lands in a FCOS release.

dustymabe avatar Feb 19 '24 14:02 dustymabe