bootc icon indicating copy to clipboard operation
bootc copied to clipboard

`bootc install to-existing-root` failure tracker

Open henrywang opened this issue 11 months ago • 12 comments

Boot openstack VM with package mode and run podman run --rm --tls-verify=false --privileged --pid=host quay.io/redhat_emp1/bootc-workflow-test:bhpq bootc install to-existing-root failed.

Error:

fatal: [guest]: FAILED! => changed=true 
  cmd:
  - podman
  - run
  - --rm
  - --tls-verify=false
  - --privileged
  - --pid=host
  - quay.io/redhat_emp1/bootc-workflow-test:bhpq
  - bootc
  - install
  - to-existing-root
  delta: '0:01:07.093168'
  end: '2025-01-23 03:00:04.342900'
  msg: non-zero return code
  rc: 1
  start: '2025-01-23 02:58:57.249732'
  stderr: |-
    ----------------------------
    WARNING: This operation will OVERWRITE THE BOOTED HOST ROOT FILESYSTEM and is NOT REVERSIBLE.
    Waiting 20s to continue; interrupt (Control-C) to cancel.
    ----------------------------
    [31mERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Importing: Parsing layer blob sha256:017dc5c1ff3b66e4764e3e88f212c903ed7ef26a19454c358ac8717b077b63df: error: ostree-tar: Processing deferred hardlink var/cache/dnf/rhel-appstream-c101d4db5fbc3a4f/repodata/527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Failed to find object: No such file or directory: 527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Processing tar: Failed to commit tar: ExitStatus(unix_wait_status(256))
  stderr_lines: <omitted>
  stdout: |-
    Installing image: docker://quay.io/redhat_emp1/bootc-workflow-test:bhpq
    Digest: sha256:8fb3136d5706463daaeed7557614eb46cb860877d94f76fbe900a8dcafd333eb
    Initializing ostree layout
    layers already present: 0; layers needed: 73 (755.2 MB)
  stdout_lines: <omitted>

Same test passed on AWS ec2 instance (both x86_64 and aarch64).

henrywang avatar Jan 23 '25 08:01 henrywang

Can you link to a log file for this job with more information? Like, what are the versions of the host system, bootc, what's in the base image etc.?

var/cache/dnf/rhel-appstream-c101d4db5fbc3a4f

Looks like this image is missing a dnf clean all?

But still though, we should work here obviously...and I don't think this could really be platform-specific; it must have something to do with how the container image is built.

Is this reproducible? Can you push the quay.io/redhat_emp1/bootc-workflow-test:bhpq image somewhere persistent?

cgwalters avatar Jan 23 '25 13:01 cgwalters

Yeah, I was working on this issue yesterday and tried on different platform to see what's different between those platforms. I think I need collect more information for debugging.

All those tests are running on the same machine (Fedora 41 VM) and test comes from https://gitlab.com/fedora/bootc/tests/bootc-workflow-test/-/blob/main/os-replace.sh?ref_type=heads script.

The base image is registry.stage.redhat.io/rhel10/rhel-bootc:10.0 and bootc version is 1.1.4. And registry.stage.redhat.io/rhel9/rhel-bootc:9.6 with bootc version is 1.1.4 has the same issue.

NOTE: quay.io/fedora/fedora-bootc:42 with bootc version is 1.1.4 does not have this issue on Azure

The test workflow is deploy RHEL 10 (package mode) VM -> run bootc install

  1. AWS: Passed
FROM registry.stage.redhat.io/rhel10/rhel-bootc:10.0
COPY rhel.repo /etc/yum.repos.d/rhel.repo
RUN dnf install -y rhc

RUN dnf -y install cloud-init && \
    ln -s ../cloud-init.target /usr/lib/systemd/system/default.target.wants && \
    rm -rf /var/{cache,log} /var/lib/{dnf,rhsm}
COPY usr/ /usr/

RUN dnf -y clean all
COPY auth.json /etc/ostree/auth.json
RUN sed -i "s/name: cloud-user/name: ec2-user/g" /etc/cloud/cloud.cfg
    Filesystem     Type      Size  Used Avail Use% Mounted on
    /dev/xvda3     xfs        20G  1.8G   19G   9% /
    devtmpfs       devtmpfs  4.0M     0  4.0M   0% /dev
    tmpfs          tmpfs     1.8G     0  1.8G   0% /dev/shm
    tmpfs          tmpfs     731M  8.6M  722M   2% /run
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
    /dev/xvda2     vfat      200M  8.4M  192M   5% /boot/efi
    tmpfs          tmpfs     366M  4.0K  366M   1% /run/user/1000
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/[email protected]
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/[email protected]
changed: [guest] => changed=true 
  cmd:
  - podman
  - run
  - --rm
  - --tls-verify=false
  - --privileged
  - --pid=host
  - quay.io/redhat_emp1/bootc-workflow-test:k71w
  - bootc
  - install
  - to-existing-root
  delta: '0:01:30.825373'
  end: '2025-01-24 03:15:41.846088'
  msg: ''
  rc: 0
  start: '2025-01-24 03:14:11.020715'
  stderr: |-
    ----------------------------
    WARNING: This operation will OVERWRITE THE BOOTED HOST ROOT FILESYSTEM and is NOT REVERSIBLE.
    Waiting 20s to continue; interrupt (Control-C) to cancel.
    ----------------------------
  stderr_lines: <omitted>
  stdout: |-
    Installing image: docker://quay.io/redhat_emp1/bootc-workflow-test:k71w
    Digest: sha256:3a1132c05390a5a334c04b7353ec0f8135ca3a0824e320e3ab157de027871c32
    Initializing ostree layout
    layers already present: 0; layers needed: 74 (772.0 MB)
    Deploying container image...done (14 seconds)
    Running bootupctl to install bootloader
    > bootupctl backend install --write-uuid --update-firmware --auto --device /dev/xvda /target
    Installed: grub.cfg
    Installation complete!
  1. Azure: Failed
FROM registry.stage.redhat.io/rhel10/rhel-bootc:10.0
COPY rhel.repo /etc/yum.repos.d/rhel.repo
RUN dnf install -y rhc
COPY etc/ /etc/

# install required packages and enable services
RUN dnf -y install \
        WALinuxAgent \
        cloud-init \
        cloud-utils-growpart \
        hyperv-daemons && \
    dnf clean all && \
    systemctl enable NetworkManager.service && \
    systemctl enable waagent.service && \
    systemctl enable cloud-init.service && \
    echo 'ClientAliveInterval 180' >> /etc/ssh/sshd_config

# configure waagent for cloud-init to handle provisioning
RUN sed -i 's/Provisioning.Agent=auto/Provisioning.Agent=cloud-init/g' /etc/waagent.conf && \
    sed -i 's/ResourceDisk.Format=y/ResourceDisk.Format=n/g' /etc/waagent.conf && \
    sed -i 's/ResourceDisk.EnableSwap=y/ResourceDisk.EnableSwap=n/g' /etc/waagent.conf
RUN dnf -y clean all
COPY auth.json /etc/ostree/auth.json
    Filesystem     Type      Size  Used Avail Use% Mounted on
    /dev/sda3      xfs        20G  2.1G   18G  11% /
    devtmpfs       devtmpfs  4.0M     0  4.0M   0% /dev
    tmpfs          tmpfs     3.8G     0  3.8G   0% /dev/shm
    efivarfs       efivarfs  128M  9.9K  128M   1% /sys/firmware/efi/efivars
    tmpfs          tmpfs     1.5G   17M  1.5G   2% /run
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
    /dev/sda2      vfat      200M  8.4M  192M   5% /boot/efi
    /dev/sdb1      ext4       74G   24K   70G   1% /mnt
    tmpfs          tmpfs     768M  4.0K  768M   1% /run/user/1000
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/[email protected]
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/[email protected]
fatal: [guest]: FAILED! => changed=true 
  cmd:
  - podman
  - run
  - --rm
  - --tls-verify=false
  - --privileged
  - --pid=host
  - quay.io/redhat_emp1/bootc-workflow-test:j75a
  - bootc
  - install
  - to-existing-root
  delta: '0:00:56.992913'
  end: '2025-01-24 03:07:39.014482'
  msg: non-zero return code
  rc: 1
  start: '2025-01-24 03:06:42.021569'
  stderr: |-
    ----------------------------
    WARNING: This operation will OVERWRITE THE BOOTED HOST ROOT FILESYSTEM and is NOT REVERSIBLE.
    Waiting 20s to continue; interrupt (Control-C) to cancel.
    ----------------------------
    [31mERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Importing: Parsing layer blob sha256:8a6a121be27996f4b6f746e353e1dd34cd40b315c0e3b81e6b874fc97fa03054: error: ostree-tar: Processing deferred hardlink var/cache/dnf/rhel-appstream-c101d4db5fbc3a4f/repodata/527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Failed to find object: No such file or directory: 527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Processing tar: Failed to commit tar: ExitStatus(unix_wait_status(256))
  stderr_lines: <omitted>
  stdout: |-
    Installing image: docker://quay.io/redhat_emp1/bootc-workflow-test:j75a
    Digest: sha256:ecaeebb45182c17021d182d08a1881d81ae1fd65a5d07ed9a0ee6087fef7d9d7
    Initializing ostree layout
    layers already present: 0; layers needed: 74 (774.6 MB)
  1. openstack: Failed
FROM registry.stage.redhat.io/rhel10/rhel-bootc:10.0
COPY rhel.repo /etc/yum.repos.d/rhel.repo
RUN dnf install -y rhc
# Enable passwordless sudo for users in the wheel group
COPY wheel-nopasswd /etc/sudoers.d
ARG sshpubkey
# We don't yet ship a one-invocation CLI command to add a user with a SSH key unfortunately
RUN if test -z "$sshpubkey"; then echo "must provide sshpubkey"; exit 1; fi; \
    useradd -G wheel cloud-user && \
    mkdir -m 0700 -p /home/cloud-user/.ssh && \
    echo $sshpubkey > /home/cloud-user/.ssh/authorized_keys && \
    chmod 0600 /home/cloud-user/.ssh/authorized_keys && \
    chown -R cloud-user: /home/cloud-user
RUN dnf -y clean all
COPY auth.json /etc/ostree/auth.json
    Filesystem     Type      Size  Used Avail Use% Mounted on
    /dev/vda3      xfs        30G  2.1G   28G   8% /
    devtmpfs       devtmpfs  4.0M     0  4.0M   0% /dev
    tmpfs          tmpfs     885M     0  885M   0% /dev/shm
    tmpfs          tmpfs     354M  5.2M  349M   2% /run
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
    /dev/vda2      vfat      200M  8.4M  192M   5% /boot/efi
    tmpfs          tmpfs     177M  4.0K  177M   1% /run/user/1000
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/[email protected]
    tmpfs          tmpfs     1.0M     0  1.0M   0% /run/credentials/[email protected]
fatal: [guest]: FAILED! => changed=true 
  cmd:
  - podman
  - run
  - --rm
  - --tls-verify=false
  - --privileged
  - --pid=host
  - quay.io/redhat_emp1/bootc-workflow-test:6sl3
  - bootc
  - install
  - to-existing-root
  delta: '0:01:19.804500'
  end: '2025-01-23 23:05:34.267542'
  msg: non-zero return code
  rc: 1
  start: '2025-01-23 23:04:14.463042'
  stderr: |-
    ----------------------------
    WARNING: This operation will OVERWRITE THE BOOTED HOST ROOT FILESYSTEM and is NOT REVERSIBLE.
    Waiting 20s to continue; interrupt (Control-C) to cancel.
    ----------------------------
    [31mERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Importing: Parsing layer blob sha256:51bc788965574e1789dc733a2f5a5034a71886aad34928edec57c80ea46fac2f: error: ostree-tar: Processing deferred hardlink var/cache/dnf/rhel-appstream-c101d4db5fbc3a4f/repodata/527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Failed to find object: No such file or directory: 527fa9e5c8c45b22a7bbc2821c96540817984e837c219acb5141a462a08d45f6-primary.xml.gz: Processing tar: Failed to commit tar: ExitStatus(unix_wait_status(256))
  stderr_lines: <omitted>
  stdout: |-
    Installing image: docker://quay.io/redhat_emp1/bootc-workflow-test:6sl3
    Digest: sha256:8438cf4f83d77a92719b98adca8bd842b72389ad3908f5eee91aa347ca538808
    Initializing ostree layout
    layers already present: 0; layers needed: 73 (755.2 MB)

henrywang avatar Jan 24 '25 04:01 henrywang

FROM registry.stage.redhat.io/rhel10/rhel-bootc:10.0
COPY rhel.repo /etc/yum.repos.d/rhel.repo
RUN dnf install -y rhc

RUN dnf -y install cloud-init && \
    ln -s ../cloud-init.target /usr/lib/systemd/system/default.target.wants && \
    rm -rf /var/{cache,log} /var/lib/{dnf,rhsm}
COPY usr/ /usr/

RUN dnf -y clean all
COPY auth.json /etc/ostree/auth.json
RUN sed -i "s/name: cloud-user/name: ec2-user/g" /etc/cloud/cloud.cfg

Note that unless you're using --squash for this build, the first RUN dnf install -y rhc is going to leak into the image all of the caches into the layer. The layer RUN dnf -y clean all will only remove them from the top - they still get shipped in the intermediate layers.

We should definitely track down this bug, because what we're doing here should work but, this will look cleaner using heredocs and it may work around this:

FROM registry.stage.redhat.io/rhel10/rhel-bootc:10.0
COPY rhel.repo /etc/yum.repos.d/rhel.repo
COPY usr/ /usr/
COPY auth.json /etc/ostree/auth.json
RUN <<EORUN
set -xeuo pipefail
dnf install -y rhc
dnf -y install cloud-init
ln -s ../cloud-init.target /usr/lib/systemd/system/default.target.wants
sed -i "s/name: cloud-user/name: ec2-user/g" /etc/cloud/cloud.cfg

dnf -y clean all
rm -rf /var/{cache,log} /var/lib/{dnf,rhsm}
EORUN

I know we should be updating some of our examples to use heredocs. One thing that has bit me is that the default podman in GitHub actions is too old for it, which is super annoying (ref https://github.com/containers/podman/discussions/17362 )

cgwalters avatar Jan 24 '25 21:01 cgwalters

Anyways OK I couldn't reproduce this in a quick test...have you reproduced this in an interactive run?

Oh hmm...I notice we may have qemu emulation going on in some builds? That might be related.

Note also that this issue should be independent of the host version because we're using podman run <image> bootc install all the code that is relevant is the ostree/bootc code inside the target image.

That said, this type of failure is also likely to occur when doing e.g. a bootc switch to that target image.

cgwalters avatar Jan 24 '25 21:01 cgwalters

Right. sed -i "s/dnf clean all/dnf clean all \&\& rm -rf \/var\/{cache,log} \/var\/lib\/{dnf,rhsm}/g" "$INSTALL_CONTAINERFILE" fixed this issue. But the persistent log does not work in this case.

henrywang avatar Jan 25 '25 06:01 henrywang

CS10, bootc 1.1.3 on libvirt has error ERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: 'overlay' is not supported over overlayfs, a mount_program is required: backing file system is unsupported for this graph driver

Log: https://artifacts.osci.redhat.com/testing-farm/ce9e7a5b-9a74-4c2d-a090-539ee208b936/

RHEL 9.6, bootc 1.1.4 all platforms has error [31mERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reference "[overlay@/var/lib/containers/storage+/run/containers/storage:overlay.mountopt=nodev,metacopy=on]quay.io/redhat_emp1/hidden:23tl@sha256:b9110b81b62013e65b36927db140f45d71da0bd49bdb2d2d0ce95b2f09749ce4" does not resolve to an image ID: identifier is not an image

Log: https://artifacts.osci.redhat.com/testing-farm/06574185-504b-43d7-a8b3-d65ce35d582e/

henrywang avatar Jan 27 '25 05:01 henrywang

fedora-bootc:41 and 42 test passed. centos-bootc:stream9 test passed.

henrywang avatar Jan 27 '25 06:01 henrywang

Installing to filesystem: Creating ostree deployment: Pulling: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: 'overlay' is not supported over overlayfs, a mount_program is required: backing file system is unsupported for this graph driver

That's...bizarre. How could it only be broken in that way on c10s but not other streams? I have no idea what's going on there.

RHEL 9.6, bootc 1.1.4 all platforms has error [31mERROR[0m Installing to filesystem: Creating ostree deployment: Pulling: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reference "[overlay@/var/lib/containers/storage+/run/containers/storage:overlay.mountopt=nodev,metacopy=on]quay.io/redhat_emp1/hidden:23tl@sha256:b9110b81b62013e65b36927db140f45d71da0bd49bdb2d2d0ce95b2f09749ce4" does not resolve to an image ID: identifier is not an image

If strema9 works but 9.6 is failing then in theory there is some skew between the two that should otherwise be the same, so we'll need to chase this. I know others have hit this, but I have no idea why this specific bit again could fail in just one stream but not others.

cgwalters avatar Jan 27 '25 14:01 cgwalters

The following error happened in AWS bootc install to-existing-root test twice today. The third times passed. ERROR: Installing to filesystem: Creating ostree deployment: Cannot redeploy over extant stateroot default

  1. https://artifacts.osci.redhat.com/testing-farm/e65ec608-9e70-41f2-947c-220128a970a7/
  2. https://artifacts.osci.redhat.com/testing-farm/8de21d8b-5933-4bf3-b500-308e9e35d7f3/

henrywang avatar Feb 15 '25 09:02 henrywang

The following scenarios are only failed with podman run --rm --tls-verify=false --privileged --pid=host 192.168.100.1:5000/hidden:kayh bootc install to-existing-root but passed with podman run --rm --tls-verify=false --privileged --pid=host -v /:/target -v /dev:/dev -v /var/lib/containers:/var/lib/containers --security-opt label=type:unconfined_t quay.io/redhat_emp1/hidden:mrmd bootc install to-existing-root

  • rhel-bootc:9.6 image : https://artifacts.osci.redhat.com/testing-farm/61bf5c6f-cea8-47e6-82c0-95080392370c/
  • centos-bootc:stream9 image: https://artifacts.osci.redhat.com/testing-farm/b53db925-7749-45a3-9b94-6a34da5a5b2e/

Error log: ERROR: Installing to filesystem: Creating ostree deployment: Pulling: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reference "[overlay@/var/lib/containers/storage+/run/containers/storage:overlay.mountopt=nodev,metacopy=on]quay.io/bootc-test/hidden:q63d@sha256:cc647a9b755f20211a5023654aeafddea5574b1d1a5771134dfb268f38d12d5e" does not resolve to an image ID: identifier is not an image

Note: rhel-bootc:10.0, centos-bootc:stream10, fedora-bootc:41/42/43 passed podman run --rm --tls-verify=false --privileged --pid=host 192.168.100.1:5000/hidden:kayh bootc install to-existing-root test. ref: https://gitlab.com/fedora/bootc/tests/bootc-workflow-test/-/jobs/9148630303

henrywang avatar Feb 15 '25 14:02 henrywang

The following error happened in AWS bootc install to-existing-root test twice today. The third times passed. ERROR: Installing to filesystem: Creating ostree deployment: Cannot redeploy over extant stateroot default

This will happen if the install is retried on failure, which we don't currently support today (but should!). If it's not a retry scenario then it will need debugging.

cgwalters avatar Feb 24 '25 14:02 cgwalters

The following scenarios are only failed with

OK so our automatic bind mounts aren't working here...but are we sure we have an updated bootc in teh images?

cgwalters avatar Feb 24 '25 14:02 cgwalters