run.py: always log the sandbox when a command fails
The main focus of commit https://github.com/systemd/mkosi/commit/7e7a3c71696d66d1774abff3eff80d5032612c88 ("Don't log sandbox for every
command") was to reduce verbosity in the --debug case and it did just
that. To avoid losing the sandbox completely, that commit said "... but
still log the full sandbox if a command fails" and it did that too in
__init__.py; for both the --debug and non-debug cases.
However, what happened in run.py was a bit different: there, commit
https://github.com/systemd/mkosi/commit/7e7a3c71696d66d1774abff3eff80d5032612c88 showed the sandox on failure but only in the --debug
case!? Make run.py consistent with __init__.py and always show the
sandbox on failure in run.py, in both --debug and non-debug case.
This helps with issue https://github.com/systemd/mkosi/issues/3948: hiding that bind mounts exist is especially confusing on fatal "disk full" failures because that leaves only misleading paths in the error message! On failure, showing such critical sandbox information must not require --debug.
Not concealing the sandbox on failure is useful in any case, not just https://github.com/systemd/mkosi/issues/3948
Signed-off-by: Marc Herbert [email protected]
These two QEMU timeouts seem unrelated.
https://github.com/systemd/mkosi/actions/runs/19246190672/job/55020942709?pr=3992 integration-test (centos, fedora, ubuntu-24.04)
‣ + /usr/bin/qemu-system-x86_64 -machine type=q35,smm=on,hpet=off -smp 1 -m 2048M -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0,id=rng-device0 -device virtio-balloon,free-page-reporting=on -no-user-config -nic user,model=virtio-net-pci --add-fd fd=4,set=1,opaque=/dev/kvm -cpu host -accel kvm,device=/dev/fdset/1 -device vhost-vsock-pci,guest-cid=2432172234,vhostfd=5 -nographic -nodefaults -chardev stdio,mux=on,id=console,signal=off -device virtio-serial-pci,id=mkosi-virtio-serial-pci -device virtconsole,chardev=console -mon console -drive if=pflash,format=qcow2,readonly=on,file=/usr/share/edk2/ovmf/OVMF_CODE_4M.secboot.qcow2 -drive file=/tmp/mkosi-ovmf-vars-6hinzbgs,if=pflash,format=qcow2 -global ICH9-LPC.disable_s3=1 -global driver=cfi.pflash01,property=secure,value=on -device virtio-scsi-pci,id=mkosi -blockdev driver=raw,node-name=mkosi,discard=unmap,file.driver=file,file.filename=/var/cache/mkosi/fe8053c10418471b/image.raw-09200783b70c46d5,file.aio=io_uring,cache.direct=ye
Running as unit: mkosi-main.scope; invocation ID: 3851b7e8a00942c09f6ef7f932acbf04
swtpm: Warning: Profile-enabled algorithms contain disabled 'RSA-1024-sign(SHA1, pkcs1-pss)'
swtpm: Warning: Setting OPENSSL_ENABLE_SHA1_SIGNATURES=1
https://github.com/systemd/mkosi/actions/runs/19246190672/job/55020942742?pr=3992 integration-test (arch, fedora, ubuntu-24.04)
‣ + /usr/bin/qemu-system-x86_64 -machine type=q35,smm=on,hpet=off -smp 1 -m 2048M -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0,id=rng-device0 -device virtio-balloon,free-page-reporting=on -no-user-config -nic user,model=virtio-net-pci --add-fd fd=4,set=1,opaque=/dev/kvm -cpu host -accel kvm,device=/dev/fdset/1 -device vhost-vsock-pci,guest-cid=667192854,vhostfd=5 -nographic -nodefaults -chardev stdio,mux=on,id=console,signal=off -device virtio-serial-pci,id=mkosi-virtio-serial-pci -device virtconsole,chardev=console -mon console -drive if=pflash,format=qcow2,readonly=on,file=/usr/share/edk2/ovmf/OVMF_CODE_4M.secboot.qcow2 -drive file=/tmp/mkosi-ovmf-vars-_iqy0jva,if=pflash,format=qcow2 -global ICH9-LPC.disable_s3=1 -global driver=cfi.pflash01,property=secure,value=on -device virtio-scsi-pci,id=mkosi -blockdev driver=raw,node-name=mkosi,discard=unmap,file.driver=file,file.filename=/var/cache/mkosi/4a7e566f971c4b8b/image.raw-ec2d4cad917c4672,file.aio=io_uring,cache.direct=yes,cache.no-flush=yes -device virtio-blk-pci,drive=mkosi,bootindex=1 -chardev socket,id=chrtpm,path=/tmp/mkosi-swtpm-map311y2/sock -tpmdev emulator,id=tpm0,chardev=chrtpm -device tpm-tis,tpmdev=tpm0 -smbios type=11,path=/tmp/mkosi-smbios-lxr8qsaw/ssh.authorized_keys.root -smbios type=11,path=/tmp/mkosi-smbios-lxr8qsaw/firstboot.locale -smbios type=11,path=/tmp/mkosi-smbios-lxr8qsaw/vmm.notify_socket -smbios type=11,path=/tmp/mkosi-smbios-lxr8qsaw/firstboot.timezone -smbios 'type=11,value=io.systemd.stub.kernel-cmdline-extra=rw systemd.wants=network.target module_blacklist=vmw_vmci systemd.tty.term.hvc0=dumb systemd.tty.columns.hvc0=80 systemd.tty.rows.hvc0=24 ip=enc0:any ip=enp0s1:any ip=enp0s2:any ip=host0:any ip=none loglevel=4 SYSTEMD_SULOGIN_FORCE=1 systemd.tty.term.console=dumb systemd.tty.columns.console=80 systemd.tty.rows.console=24 console=hvc0 TERM=dumb' -smbios 'type=11,value=io.systemd.boot.kernel-cmdline-extra=rw systemd.wants=network.target module_blacklist=vmw_vmci systemd.tty.term.hvc0=dumb systemd.tty.columns.hvc0=80 systemd.tty.rows.hvc0=24 ip=enc0:any ip=enp0s1:any ip=enp0s2:any ip=host0:any ip=none loglevel=4 SYSTEMD_SULOGIN_FORCE=1 systemd.tty.term.console=dumb systemd.tty.columns.console=80 systemd.tty.rows.console=24 console=hvc0 TERM=dumb' -cpu max,pcid=off
Running as unit: mkosi-main.scope; invocation ID: 603f17fcf26f4adaa4d38583d5eb4be0
swtpm: Warning: Profile-enabled algorithms contain disabled 'RSA-1024-sign(SHA1, pkcs1-pss)'
swtpm: Warning: Setting OPENSSL_ENABLE_SHA1_SIGNATURES=1
qemu-system-x86_64: terminating on signal 15 from pid 9979 (timeout)
qemu-system-x86_64: tpm-emulator: Could not cleanly shutdown the TPM: Invalid argument
Another 1h timeout. Only one this time. https://github.com/systemd/mkosi/actions/runs/19248409572/job/55027963769?pr=3992 integration-test (fedora, fedora, ubuntu-24.04
+ /usr/bin/qemu-system-x86_64 -machine type=q35,smm=on,hpet=off -smp 1 -m 2048M -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0,id=rng-device0 -device virtio-balloon,free-page-reporting=on -no-user-config -nic user,model=virtio-net-pci --add-fd fd=4,set=1,opaque=/dev/kvm -cpu host -accel kvm,device=/dev/fdset/1 -device vhost-vsock-pci,guest-cid=199748897,vhostfd=5 -nographic -nodefaults -chardev stdio,mux=on,id=console,signal=off -device virtio-serial-pci,id=mkosi-virtio-serial-pci -device virtconsole,chardev=console -mon console -drive if=pflash,format=qcow2,readonly=on,file=/usr/share/edk2/ovmf/OVMF_CODE_4M.secboot.qcow2 -drive file=/tmp/mkosi-ovmf-vars-0dxakgzu,if=pflash,format=qcow2 -global ICH9-LPC.disable_s3=1 -global driver=cfi.pflash01,property=secure,value=on -device virtio-scsi-pci,id=mkosi -blockdev driver=raw,node-name=mkosi,discard=unmap,file.driver=file,file.filename=/var/cache/mkosi/5d82d0a0ebb541fd/image.raw-3eb04d1fb30a4ee1,file.aio=io_uring,cache.direct=yes,cache.no-flush=yes -device virtio-blk-pci,drive=mkosi,bootindex=1 -chardev socket,id=chrtpm,path=/tmp/mkosi-swtpm-lmqx23zd/sock -tpmdev emulator,id=tpm0,chardev=chrtpm -device tpm-tis,tpmdev=tpm0 -smbios type=11,path=/tmp/mkosi-smbios-cf3eooie/ssh.authorized_keys.root -smbios type=11,path=/tmp/mkosi-smbios-cf3eooie/cryptsetup.passphrase -smbios type=11,path=/tmp/mkosi-smbios-cf3eooie/firstboot.locale -smbios type=11,path=/tmp/mkosi-smbios-cf3eooie/vmm.notify_socket -smbios type=11,path=/tmp/mkosi-smbios-cf3eooie/firstboot.timezone -smbios 'type=11,value=io.systemd.stub.kernel-cmdline-extra=rw systemd.wants=network.target module_blacklist=vmw_vmci systemd.tty.term.hvc0=dumb systemd.tty.columns.hvc0=80 systemd.tty.rows.hvc0=24 ip=enc0:any ip=enp0s1:any ip=enp0s2:any ip=host0:any ip=none loglevel=4 SYSTEMD_SULOGIN_FORCE=1 systemd.tty.term.console=dumb systemd.tty.columns.console=80 systemd.tty.rows.console=24 console=hvc0 TERM=dumb' -smbios 'type=11,value=io.systemd.boot.kernel-cmdline-extra=rw systemd.wants=network.target module_blacklist=vmw_vmci systemd.tty.term.hvc0=dumb systemd.tty.columns.hvc0=80 systemd.tty.rows.hvc0=24 ip=enc0:any ip=enp0s1:any ip=enp0s2:any ip=host0:any ip=none loglevel=4 SYSTEMD_SULOGIN_FORCE=1 systemd.tty.term.console=dumb systemd.tty.columns.console=80 systemd.tty.rows.console=24 console=hvc0 TERM=dumb' -cpu max,pcid=off
Running as unit: mkosi-swtpm-main.scope; invocation ID: d43ccbfe02344be18b11af24d2e294f3
Running as unit: mkosi-main.scope; invocation ID: 2d38e54d4ffe4aacba3e683971e42b84
swtpm: Warning: Profile-enabled algorithms contain disabled 'RSA-1024-sign(SHA1, pkcs1-pss)'
swtpm: Warning: Setting OPENSSL_ENABLE_SHA1_SIGNATURES=1
qemu-system-x86_64: terminating on signal 15 from pid 12244 (timeout)
qemu-system-x86_64: tpm-emulator: Could not cleanly shutdown the TPM: Invalid argument
All green in https://github.com/systemd/mkosi/actions/runs/19249968529 on 3rd try.
#3993 was less lucky and needed many more attempts.
It's really not necessary to get CI fully green. mkosi CI is very susceptible to variations on the distro side, e.g. flaky mirrors.
So I'm of two minds of this. The sandbox commands can be really long and looking at the full command it's harder to see the needle in the haystack. For a lot of use cases seeing the actual command that failed is sufficient and showing the sandbox there would then make it harder to see what is going on.
What would you think about a more actionable error message (sandbox command omitted, rerun with --debug to see full command)?
I'd prefer if we repurpose --debug-sandbox to always show the full command and then add --trace-sandbox to cover the current behavior of --debug-sandbox which runs mkosi-sandbox under strace
It's really not necessary to get CI fully green.
Thanks, I was afraid this was a recent regression and that this PR would be lower down the list because of the test failures.
mkosi CI is very susceptible to variations on the distro side, e.g. flaky mirrors.
I saw a fair amount of timeouts (noting most of them in the comments) but I never saw a flaky mirror. It was always a QEMU timeout.
The sandbox commands can be really long and looking at the full command it's harder to see the needle in the haystack. For a lot of use cases seeing the actual command that failed is sufficient and showing the sandbox there would then make it harder to see what is going on.
That's fair, I focused on systemd-repart failures as described in #3948 but I understand the sandbox may not always matter than much.
Either way, this does print a very long, "haystack" command line. For which I thought of a dead-simple fix: simply split it across 2 lines instead! Like this:
‣ in sandbox: /bin/python3 -SI /tmp/rq_mkosi_wspaces/mkosi-workspace-90ljx_31/tmp/tmprbny4x14/mkosi/sandbox.py --proc /proc --unsetenv TMPDIR --setenv SYSTEMD_OFFLINE 0 --unshare-net --ro-bind /usr /usr --ro-bind /opt /opt --symlink usr/bin /bin --symlink usr/sbin /sbin --symlink usr/lib /lib --symlink usr/lib64 /lib64 --dir /var/tmp --dir /var/log --unshare-ipc --ro-bind /tmp/rq_mkosi_wspaces/mkosi-workspace-90ljx_31/tmp/tmprbny4x14/mkosi/sandbox.py /sandbox.py --dev /dev --ro-bind /etc/alternatives /etc/alternatives --ro-bind /etc/ld.so.cache /etc/ld.so.cache --setenv PATH /usr/bin:/usr/sbin --bind /tmp/rq_mkosi_wspaces/mkosi-workspace-90ljx_31/tmp/mkosi-var-tmp-d1b415e539de44bd /var/tmp --dir /etc --dir /var --dir /tmp --dir /run --become-root --bind /tmp/rq_mkosi_wspaces/mkosi-workspace-90ljx_31/staging /work/tmp/rq_mkosi_wspaces/mkosi-workspace-90ljx_31/staging --bind /tmp/rq_mkosi_wspaces/mkosi-workspace-90ljx_31/root /buildroot --ro-bind /tmp/rq_mkosi_wspaces/mkosi-workspace-90ljx_31/repart-definitions /work/tmp/rq_mkosi_wspaces/mkosi-workspace-90ljx_31/repart-definitions --ro-bind /etc/pki/tls/certs/ca-bundle.crt /proxy.cacert
‣ "systemd-repart --empty=allow --size=auto --dry-run=no --json=pretty --no-pager --root=/buildroot --offline=yes --seed 9457e3c9-bd35-45d8-acd2-1446627bed37 /work/tmp/rq_mkosi_wspaces/mkosi-workspace-90ljx_31/staging/root.img.raw --empty=create --defer-partitions esp,xbootldr --generate-fstab=/etc/fstab --generate-crypttab=/etc/crypttab --definitions /work/tmp/rq_mkosi_wspaces/mkosi-workspace-90ljx_31/repart-definitions" returned non-zero exit code 1.
Haystack problem solved. That data already comes separate, so why concatenate them after all?
Remember: this is still only on failure. When the some command fails, users much, much prefer to have "non-optimal" but complete information about that command. Especially when the command that failed is buried below multiple layers of wrapper scripts in a CI far away, in an environment that is unfortunately not easy to reproduce and that they can't change much. Yes this is the worst possible case but it's not theoretical either; happens regularly.
(I just realized that 0c4a8955b08966 has later added to run.py similar code and extended the 7e7a3c71696d66d inconsistency with __init__.py. Next time I push to this PR I will address both instances in run.py)
I'd prefer if we repurpose --debug-sandbox to always show the full command and then add --trace-sandbox to cover the current behavior of --debug-sandbox which runs mkosi-sandbox under strace
I'm not 100% sure how big of a change that would be but it sounds like a much bigger change than this small PR. Probably bigger than the time I have :-(
Value and effort considerations aside, I really think that sandbox information is too important and too relevant to be behind some --debug flag on failure.
Sorry but this is just too verbose to do by default. I gave it a try and it's the difference between a two line error message and an error message that takes up half my terminal when running dnf.
I'm going to close this. I'm confident this is not the right solution.