asu icon indicating copy to clipboard operation
asu copied to clipboard

build.py: podman imagebuilder out of sync

Open efahl opened this issue 1 year ago • 33 comments

cc: @lorand-horvath

ASU build has been generating these errors on x86/64 snapshot for a day or so, several similar errors being reported on the forum.

Received incorrect version r26608-b2a84c86e3 (requested r26620-6b03050192)

Since the "janitor: drop" commit the metadata updates rapidly, and both https://sysupgrade.openwrt.org/json/v1/snapshots/targets/x86/64/generic.json and https://downloads.openwrt.org/snapshots/targets/x86/64/profiles.json contain the correct 26620 version_code, so it seems that the previous source of this issue has been resolved.

This time it appears to me to be the podman image is out of sync, behind the current downloads imagebuilder. I couldn't find anything in update.py that would invalidate or update the podman.images collection.

But in build.py we know something is wrong. Just a wild, untested idea, when this specific error is detected, maybe just delete the podman image, so the next try does a fresh podman.images.pull? (At least I'm assuming that's how pull works.) In build.py:

    log.info(f"Pulling {image}...")
    pm_image = podman.images.pull(image)        <<< Save the returned image...
    log.info(f"Pulling {image}... done")

...

    if "version_code" in req:
        if version_code != req.get("version_code"):
            pm_image.remove(force=True)          <<< Remove it before reporting error, or maybe 'reload'???
            report_error(

Refs: https://podman-py.readthedocs.io/en/stable/podman.domain.images_manager.html#podman.domain.images_manager.ImagesManager.pull https://podman-py.readthedocs.io/en/stable/podman.domain.images.html#podman.domain.images.Image.remove

efahl avatar Jun 14 '24 15:06 efahl

@aparcar Does my analysis above make any sense? I just tried again, and the metadata is now updated to 26629, but ASU's imagebuilder remains back on 26608 as above...

Error: Received incorrect version r26608-b2a84c86e3 (requested r26629-7a9b1f7f25)

efahl avatar Jun 14 '24 22:06 efahl

The strangest thing just occurred: I've requested a custom build with ASU (luci-app-attendedsysupgrade), it said it had a newer build available, namely 26620. I requested it, but it came back with the old version stuck at 26593 (as I expected due to this bug). I proceeded to upgrade to this so-called 26593 version, at least that's what ASU said it was.

After it flashed and rebooted, I check the version in LuCI and low and behold, it has actually flashed the latest version 26620. What the heck is going on?

Note: ath79/generic unit WR1043ND v3

lorand-horvath avatar Jun 15 '24 09:06 lorand-horvath

luci-app-attendedsysupgrade

There's your culprit in this case. Everything else (auc, owut, firmware selector) specifies the version_code in the build blob, which the ASU builder verifies request and imagebuilder as being the same (https://github.com/openwrt/asu/blob/main/asu/build.py#L72). If you leave it out, then you can sometimes get a build to go through, even when the metadata is different than the actual imagebuilder.

I experimented with an option in owut to turn off the version code spec, but decided it was more robust to keep it and to try to fix up the builds, because half (or more) of the time it didn't work and just wasted build cycles (got a different error later, like "package version mismatch" or some such thing).

efahl avatar Jun 15 '24 12:06 efahl

Right now (2024-06-15 18:40Z) the ASU server is ahead of the downloads site:

snapshots x86/64 generic
  r26634-7d9b9762c9  ASU Server
  r26633-22f92cce22  Build Server

But the ASU builder is still stuck at 26608

Error: Received incorrect version r26608-b2a84c86e3 (requested r26633-22f92cce22)

My interpretation is that the web hooks are not flushing the old imagebuilders (and are firing off at the start of the build, not its completion???).

efahl avatar Jun 15 '24 18:06 efahl

WNDRMAC v2 (ath79/generic) - I tried to build a new custom firmware with the firmware-selector, it said 26620 is available, but after build it still came back with the old 26593 from 5 days ago.

Then I checked with ASU (luci-app-attendedsysupgrade) and it said 26620 was available, I requested it, came back with 26593, I even downloaded it from the ASU interface to my PC and I confirm it's 26593 indeed. Then I proceeded to ask ASU to actually flash it and, as you might have guessed, it's now on 26620. Very strange.

This is a mess.

Also, since ath79 was bumped from kernel 6.1 to 6.6 yesterday, it doesn't even build at all anymore.

lorand-horvath avatar Jun 16 '24 07:06 lorand-horvath

As far as I understand it is a similar problem and is "under construction" https://forum.openwrt.org/t/auc-sysupgrade-does-not-work-for-me/196782/39

pko76 avatar Jun 16 '24 10:06 pko76

There's your culprit in this case. Everything else (auc, owut, firmware selector) specifies the version_code in the build blob, which the ASU builder verifies request and imagebuilder as being the same (https://github.com/openwrt/asu/blob/main/asu/build.py#L72). If you leave it out, then you can sometimes get a build to go through, even when the metadata is different than the actual imagebuilder.

In the meantime before this problem get fixed, I've created https://github.com/openwrt/luci/pull/7163 to add the check to luci-app-attendedsysupgrade so it behaves in the same way as auc.

dannil avatar Jun 16 '24 16:06 dannil

My guess is that this issue got multiple layers. First the Docker containers would download from downloads.openwrt.org which is behind a CDN. Even though the cache are cleared on every fresh target upload, there may be a timing issue where Docker containers start downloading ImageBuilders while not yet fully cleared (a matter of a few seconds I think).

Second is that ASU caches successful builds for 7 days. The identify the "same" build request and allow this kind of caching, a bund of parameters are put into account, including version_code. Since the LuCI lacked that parameter, build requests from up to 7 days ago would be returned. Thanks to @dannil this issue should be fixed for LuCI.

Right now this isn't happening for the firmware selector, so please feel free to implement it there, too.

aparcar avatar Jun 17 '24 07:06 aparcar

        pm_image.remove(force=True)          <<< Remove it before reporting error, or maybe 'reload'???

Using podman.images.pull() update the container whenever something new is available, no need to remove it locally

aparcar avatar Jun 17 '24 07:06 aparcar

As you say in German, all good things are three: The GitHub token ran out over the weekend causing the webhook to never run any container updates.

aparcar avatar Jun 17 '24 09:06 aparcar

The GitHub token ran out over the weekend causing the webhook to never run any container updates.

GAH! That explains a lot.

efahl avatar Jun 17 '24 14:06 efahl

I'm still seeing the same issue. Ran two builds on different x86 machines, about a half hour apart.

Received incorrect version r26657-531b3f667c (requested r26670-5f9fb964c3)

ASU and downloads profiles match with the 26670 version, but the image is updated from previous, but now stuck back at 26657

Builds at https://sysupgrade.openwrt.org/api/v1/build/eaf666a98b390bda3f8896a8b4000ee0 https://sysupgrade.openwrt.org/api/v1/build/eecd821426ce6f223f079b925029be2c if that helps...

efahl avatar Jun 17 '24 16:06 efahl

Ok, just tried again with the old 'add "#" to uci-defaults' trick for force a new hash, and it worked (i.e., got a valid image and installed it), so the pull is working as expected.

If I leave off the uci-defaults, I get the same 'incorrect version' as before, from the cached build that failed... Still seems like that remove might alleviate these "in-between" errors somewhat, at least for that specific error. (This is assuming pm_image.remove works like I'm assuming, which is completely untested.)

efahl avatar Jun 17 '24 20:06 efahl

Explain this: Error: Received incorrect version r26674-29b98a0192 (requested r26620-6b03050192) I actually requested 26674, not 26620 - that's the current version installed.

lorand-horvath avatar Jun 18 '24 05:06 lorand-horvath

requested 26674, not 26620

What upgrade client did you use that gave this?

This typically happens when the sysupgrade metadata lags behind the build, which is the issue that Paul's webhook implementation addressed.

efahl avatar Jun 18 '24 15:06 efahl

requested 26674, not 26620

What upgrade client did you use that gave this?

Latest luci-app-attendedsysupgrade 24.168.60561~8860ca0

lorand-horvath avatar Jun 18 '24 18:06 lorand-horvath

requested 26674, not 26620

What upgrade client did you use that gave this?

Latest luci-app-attendedsysupgrade 24.168.60561~8860ca0

We may have borked the check in LuCI ASU with my change in https://github.com/openwrt/luci/pull/7163, see https://github.com/openwrt/luci/pull/7163#issuecomment-2176678529. For now use auc or the firmware selector to upgrade, I have a fix coming as long as I get confirmation of my suspicions.

dannil avatar Jun 18 '24 18:06 dannil

Just for your info. I have just upgraded from 23.05-SNAPSHOT r23901 to 23.05-SNAPSHOT r23902-50148a40d2 using luci-i18n-attendedsysupgrade-pl (git-24.167.26541-c74a06c) or according luci-app-attendedsysupgrade (git-23.339.51123-138595a), and to my surprise, it worked fine. The detected "current" and "new offered" version were correct.

pko76 avatar Jun 18 '24 20:06 pko76

according luci-app-attendedsysupgrade (git-23.339.51123-138595a)

That would be as expected, as that version doesn't contain the version_code fix, so works most of the time. If no version_code is supplied, then neither LuCI ASU nor ASU build server do any checks and you get whatever the build server happens to have on hand.

efahl avatar Jun 18 '24 20:06 efahl

So, you mean I was just lucky this time? It wasn't working well for previous tries, yesterday and before. Is it a random process?

pko76 avatar Jun 18 '24 20:06 pko76

It's not so much luck, but rather LuCI ASU did the right thing like 99% of the time.

It all depends on the state of the ASU server's cached builds, and how they correspond to the profile metadata in those two files I linked in the opening comment (actually the ones for your specific platform).

If the version_code is not specified, as before @dannil's fix to LuCI ASU, then you get whatever happens to be the latest build, assuming there's nothing matching in the ASU server's cache, or the cached imagebuilder if there is.

This works almost all the time, because the number of upgrades people do for a specific platform isn't very big, but if someone did an upgrade in that grey period between when the imagebuilder is published and the metadata is updated to correspond, then things would be out of whack for anyone else using the same platform, up to the time the next build came out and the builders/metadata got updated.

The window used to be "too long" and sometimes would be days, causing lots of issues, and that's where aparcar's "webhooks" patch came from: it reduces the window to seconds instead of minutes, hours or sometimes "forever".

The missing version_code in LuCI ASU is also the reason that some people would try auc, get an error, then try LuCI ASU and it "would work". Well, it worked in that you got an upgrade, but it wasn't the latest, it was often the build before the latest for these specific cases.

efahl avatar Jun 18 '24 21:06 efahl

Thank you efahl. It justifies the cases, when I had to remove or add any packages to succeed with the LuCi upgrade. I will monitor the situation and let you know.

pko76 avatar Jun 18 '24 22:06 pko76

We may have borked the check in LuCI ASU with my change in openwrt/luci#7163, see openwrt/luci#7163 (comment). For now use auc or the firmware selector to upgrade, I have a fix coming as long as I get confirmation of my suspicions.

@dannil LuCI ASU is definitely broken now. I've just tried to use it and it errors out with:

Error building the firmware image
Server response: Error: Received incorrect version r26693-368441254e (requested r26679-70088a7e4c)

even though the requested version was 26693 (the new version), not 26679 (currently installed).

lorand-horvath avatar Jun 19 '24 08:06 lorand-horvath

We may have borked the check in LuCI ASU with my change in openwrt/luci#7163, see openwrt/luci#7163 (comment). For now use auc or the firmware selector to upgrade, I have a fix coming as long as I get confirmation of my suspicions.

@dannil LuCI ASU is definitely broken now. I've just tried to use it and it errors out with:

Error building the firmware image
Server response: Error: Received incorrect version r26693-368441254e (requested r26679-70088a7e4c)

even though the requested version was 26693 (the new version), not 26679 (currently installed).

https://github.com/openwrt/luci/pull/7168. I will still try to figure out a way to solve the issue with LuCI ASU as well so the version_code parameter is sent to the API, however the solution became quite more involved since LuCI ASU allows you to sysupgrade to the currently installed version as well if you check the box for advanced mode, and specifying version_code to an older revision doesn't work on snapshots (the server errors out); for auc and the firmware selector this isn't a use case so we didn't hit it there.

dannil avatar Jun 19 '24 17:06 dannil

specifying version_code to an older revision doesn't work on snapshots

Older snapshots isn't a use case for any upgrade tool. The only way to get a non-current snapshot is to build from source, since the snapshot imagebuilders and packages aren't archived, just deleted.

All of the following give me the proper value for upgrading to current for that version, just stick that in the request blob.revision_code and it should work:

https://sysupgrade.openwrt.org/api/v1/revision/22.03.6/x86/64 https://sysupgrade.openwrt.org/api/v1/revision/23.05.3/x86/64 https://sysupgrade.openwrt.org/api/v1/revision/23.05-SNAPSHOT/x86/64 https://sysupgrade.openwrt.org/api/v1/revision/SNAPSHOT/x86/64

(There are still some versions that have not updated properly since the webhooks commit, I've been watching 22.03-SNAPSHOT, which is stuck at 20311 on ASU server, but should be 20315 according to downloads, so avoid that in your testing.)

efahl avatar Jun 20 '24 13:06 efahl

@aparcar I've been tracking the server updates for about two months now, and currently the most prevalent one is that it looks like the CDN is not flushing caches. Here are three of the target/profile/version's that I'm watching, showing both downloads and mirror-03 in sync, but the sysupgrade site is behind.

I download the four (well, three as ASU API is of course a redirect to ASU static), extract the version_code and check them against one another.

The goofy one here is x86, where downloads is also behind the ASU mirror...

snapshots tegra/generic compulab_trimslice
  r27032-274b18d105  ASU API    https://sysupgrade.openwrt.org/api/v1/revision/SNAPSHOT/tegra/generic
  r27032-274b18d105  ASU Server https://sysupgrade.openwrt.org/json/v1/snapshots/targets/tegra/generic/compulab_trimslice.json
  r27043-fe9c99806f  DL  Server https://downloads.openwrt.org/snapshots/targets/tegra/generic/profiles.json
  r27043-fe9c99806f  Mirror3    https://mirror-03.infra.openwrt.org/snapshots/targets/tegra/generic/profiles.json
snapshots mediatek/mt7622 linksys_e8450-ubi
  r27032-274b18d105  ASU API    https://sysupgrade.openwrt.org/api/v1/revision/SNAPSHOT/mediatek/mt7622
  r27032-274b18d105  ASU Server https://sysupgrade.openwrt.org/json/v1/snapshots/targets/mediatek/mt7622/linksys_e8450-ubi.json
  r27044-249b949a26  DL  Server https://downloads.openwrt.org/snapshots/targets/mediatek/mt7622/profiles.json
  r27044-249b949a26  Mirror3    https://mirror-03.infra.openwrt.org/snapshots/targets/mediatek/mt7622/profiles.json
snapshots x86/64 generic
  r27041-7686ce4a91  ASU API    https://sysupgrade.openwrt.org/api/v1/revision/SNAPSHOT/x86/64
  r27041-7686ce4a91  ASU Server https://sysupgrade.openwrt.org/json/v1/snapshots/targets/x86/64/generic.json
  r27041-7686ce4a91  DL  Server https://downloads.openwrt.org/snapshots/targets/x86/64/profiles.json
  r27046-940f83cc8d  Mirror3    https://mirror-03.infra.openwrt.org/snapshots/targets/x86/64/profiles.json

efahl avatar Jul 30 '24 17:07 efahl

I just checked, and Mirror3 doesn't set the Cache-Control: header, but it does set Last-Modified: header as well as ETag:, so all classic HTTP cache types of play should just work without needing anything more than that...

dangowrt avatar Jul 30 '24 17:07 dangowrt

Yeah, and both sysupgrade and asu-1.infra set both etag and last-modified, too. So it looks like the notification of sysupgrade for to update is missing something.

I sometimes see half hour differences, but many of them stick for much longer. Here are a pair of representative log entries (nearing two days for this one):

ds=downloads, as=sysupgrade, is=mirror-03

2024-07-29T23:45Z ds=r27041-7686ce4a91 as=r27002-a9f58dddb7 is=r27041-7686ce4a91 d=ath79/generic-snapshots
2024-07-30T18:00Z ds=r27041-7686ce4a91 as=r27002-a9f58dddb7 is=r27041-7686ce4a91 d=ath79/generic-snapshots

My path to the sysupgrade server, looks like it crosses the Atlantic at hop 8-9.

$ nslookup sysupgrade.openwrt.org
Server:         127.0.0.1
Address:        127.0.0.1:53

Non-authoritative answer:
sysupgrade.openwrt.org  canonical name = asu-01.infra.openwrt.org
Name:   asu-01.infra.openwrt.org
Address: 2a03:b0c0:3:d0::1574:1

Non-authoritative answer:
sysupgrade.openwrt.org  canonical name = asu-01.infra.openwrt.org
Name:   asu-01.infra.openwrt.org
Address: 206.81.21.114

$ tracepath 2a03:b0c0:3:d0::1574:1
 1?: [LOCALHOST]                        0.015ms pmtu 1500
 1:  rtr00-br-lan.brain.lan                                0.884ms
 1:  rtr00-br-lan.brain.lan                                0.408ms
 2:  2600:8802:48ff:ffff::1111                            11.814ms
 3:  2001:578:400:7:f:0:2:8026                            13.811ms
 4:  2001:578:400:4:5000::34                              23.728ms
 5:  2001:578:1:0:172:17:249:32                           20.159ms
 6:  no reply
 7:  rest-bb1-v6.ip.twelve99.net                          69.803ms
 8:  nyk-bb2-v6.ip.twelve99.net                           72.144ms
 9:  ldn-bb1-v6.ip.twelve99.net                          151.990ms
10:  slou-b2-v6.ip.twelve99.net                          148.377ms
11:  digitalocean-ic-378006.ip.twelve99-cust.net         153.614ms
12:  2a03:b0c0:fffe::92                                  153.379ms
13:  no reply
14:  no reply
15:  no reply
16:  no reply
17:  asu-01.infra.openwrt.org                            165.440ms reached
     Resume: pmtu 1500 hops 17 back 17

$ tracepath 206.81.21.114
 1?: [LOCALHOST]                      pmtu 1500
 1:  10.1.1.1                                              0.722ms
 1:  10.1.1.1                                              0.756ms
 2:  10.72.48.1                                           10.190ms
 3:  100.120.104.120                                      12.770ms
 4:  100.120.104.16                                       10.877ms asymm  3
 5:  no reply
 6:  lax-b23-link.ip.twelve99.net                         12.591ms asymm  5
 7:  rest-bb1-link.ip.twelve99.net                        66.648ms asymm  8
 8:  nyk-bb1-link.ip.twelve99.net                         80.504ms asymm  9
 9:  ldn-bb2-link.ip.twelve99.net                        144.356ms asymm 10
10:  slou-b2-link.ip.twelve99.net                        145.945ms asymm  9
11:  digitalocean-ic-378007.ip.twelve99-cust.net         148.335ms asymm 10
12:  143.244.224.146                                     151.305ms asymm 11
13:  no reply
14:  no reply
15:  no reply
16:  no reply
17:  asu-01.infra.openwrt.org                            163.032ms reached
     Resume: pmtu 1500 hops 17 back 16

efahl avatar Jul 30 '24 18:07 efahl

My assumption is that the CI that actually creates new Docker containers is overloaded since it's competing with openwrt.git CI jobs for open runners. I moved the creation of Docker containers to a self hosted runner, hoping this works better.

https://github.com/openwrt/docker/commit/d629f7ea1a1007125f89a23b22342f04130bfd2c

Let's wait a day or two and see where things are.

aparcar avatar Jul 30 '24 22:07 aparcar

Also, I've just reworked owut substantially in this regard:

  1. It's now grabbing version_code from upstream (I finally implemented what you told me in https://github.com/efahl/owut/issues/2#issuecomment-2165021615).
  2. Added a new --rev-code option so if things fail with the dreaded incorrect version, users can paste in the "expected" version and re-run, which should fix things.
  3. If that fails and they are desperate, there's a --rev-code none feature that just turns off the version_code checking and they get whatever they get.

I've been testing it all afternoon, and it seems to behave better on my synthetic and real builds, so fingers crossed...

efahl avatar Jul 30 '24 23:07 efahl