uv
uv copied to clipboard
Bytecode timed out (60s)
I'm getting "Bytecode timed out (60s)" when running in CI on QEMU emulated arm64. It is expected that this setup is way slower than usual, but I couldn't find a way to change this timeout, as it seems hard-coded:
https://github.com/astral-sh/uv/blob/7551097a170e02093997b1cdaff1dd86fc30c27a/crates/uv-installer/src/compile.rs#L22
Having this configurable, it could be overridden in environments where byte code compilation might take long (it can be a combination of many modules being installed and a slow system).
I'm currently using uv==0.2.36.
The actual log:
#9 405.5 <jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
#9 405.5 <jemalloc>: (This is the expected behaviour if you are running under QEMU)
#9 406.8 Resolved 2 packages in 442ms
#9 466.2 Prepared 2 packages in 59.41s
#9 466.2 Installed 2 packages in 5ms
#9 526.4 error: Failed to bytecode-compile Python file in: app/venv/lib/python3.12/site-packages
#9 526.4 Caused by: Bytecode timed out (60s)
Full CI log is here: https://github.com/WeblateOrg/docker/actions/runs/10402861578/job/28808187107
Interesting... I think that's a timeout we set per file, so it's intended to catch cases in which Python hangs but doesn't give us any indicator. Do you think it's plausible that a file is taking > 60 seconds under QEMU? We can of course make it configurable.
Looking again at the log, it happens when installing cffi only, so there shouldn't be that much files to compile. It there way to log verbosely what is going on there (besides --verbose which I've tried)?
I'll have to ask @konstin when they're back from vacation.
There's also RUST_LOG=trace, but I'm not sure you'll get much more helpful information.
Hmm, so far the issue didn't happen with verbose, will keep trying. On the other side, sometimes the job is terminated after uv doing something for six hours, most likely during byte compiling as well. Unfortunately it doesn't seem reproducible, it happens sometimes, making it harder to debug.
Here is the end of verbose log, which timed out after 6 hours:
#9 423.7 DEBUG Finished building: pycparser==2.22
#9 462.3 DEBUG Finished building: cffi==1.17.0
#9 462.3 Prepared 2 packages in 58.82s
#9 462.3 Installed 2 packages in 6ms
#9 462.3 DEBUG Starting 4 bytecode compilation workers
#9 463.7 DEBUG Bytecode compilation worker exiting: Ok(())
#9 463.8 DEBUG Bytecode compilation worker exiting: Ok(())
No additional output for the rest of nearly 6 hours.
The command line executed here was:
uv pip install \
--no-cache-dir \
--compile-bytecode \
--no-binary :all: \
cffi==1.17.0
Executed inside docker build on Qemu emulated arm64.
Full CI log here: https://github.com/WeblateOrg/docker/actions/runs/10411635489/job/28835914859
I tried but couldn't reproduce this locally, it looks like a non-deterministic failure, and unfortunately i don't have any good idea where we could be happening or how QEMU plays into this.
QEMU just makes everything slow. It might be a race condition somewhere. The behavior ends up to be random - sometimes it just works, sometimes it ends up with Bytecode timed out (60s) and sometimes it hangs until GitHub kills it after few hours. I have enabled debug, but it really doesn't bring any useful info here:
1360.2 Prepared 199 packages in 14m 37s
1361.3 Installed 204 packages in 1.13s
1361.3 DEBUG Starting 4 bytecode compilation workers
1433.0 DEBUG Bytecode compilation worker exiting: Ok(())
1433.0 DEBUG Bytecode compilation worker exiting: Ok(())
1433.0 DEBUG Bytecode compilation worker exiting: Ok(())
1433.0 DEBUG Released lock at `/app/venv/.lock`
Failed to bytecode-compile Python file in: /app/venv/lib/python3.12/site-packages
1434.5 Caused by: Bytecode timed out (60s)
I've also tried adding RUST_LOG=trace but it doesn't seem to add anything useful to the logs. Can the debug logs be more detailed in the byte compilation so that it is easier to debug where actually the problem lies?
I think the problematic part is starting up the processes. GitHub workers are definitely a shared CPU cores, and slowing this with QEMU can make it easy that the Python takes long to start. So the code quite likely ends up here in some cases:
https://github.com/astral-sh/uv/blob/ccdf2d793bbc2401c891b799772f615a28607e79/crates/uv-installer/src/compile.rs#L308-L310
My rust knowledge is zero, so I don't really understand how this situation is handled in the rest of the code.
Anyway, I've written https://github.com/astral-sh/uv/pull/6958 to separate timeout exceptions so that it is clear whether the issue is in Python startup or in byte-compiling.
Just got
#17 12.98 Installed 214 packages in 1.98s
#17 712.3 error: Failed to bytecode-compile Python file in: /opt/app-root/lib/python3.11/site-packages
#17 712.3 Caused by: Bytecode timed out (60s)
Also doing a QEMU stunt, building for arm64 on amd64.
Interesting... I think that's a timeout we set per file, so it's intended to catch cases in which Python hangs but doesn't give us any indicator.
Interesting.
I think things were making good progress, despite being slow. Any way we can tune that timeout constant?
This is uv 0.4.20.
I originally asked for configurable timeout as well, but now I doubt that increasing the timeout will do any good.
As mentioned before, sometimes the byte compilation just timeouts for me at GitHub after 6 hours, while normally the byte compilation takes 90 seconds. I'm sure that the VM used to run the action is not suddenly several magnitudes slower (all other steps take comparable time). So there has to be something wrong in the communication between the processes, what is exposed only occasionally in a slow environment. Sometimes the parent process detects it and fails with a timeout, sometimes it doesn't and hangs.
Getting this error as well on bitbucket pipelines in Docker. We are not doing anything about QEMU. I’m currently trying to increase the size of the runner to see if this is the solution. Locally, the build passes without issue.
We made some updates in our lockfile recently (and I think also moved from 3.12.6 to 3.12.7), so I wonder if there is a specific lib that we updated that triggers this behaviour, I’ll try to bisect and see which lib could be the issue.
Ok, so I confirm nothing changed in our code (a pipeline that passed yesterday is failing today). There was maybe an update in the underlying docker image. I’ve noticed the version was not pinned so I’m trying to see if that helps with the reproduction.
Maybe a naive question, but could this be memory related? I've had this issue appear, and then it went away when I increased the memory limit of the runner to 2GiB, although that's of course no proof that more memory is the solution.
GitHub action runners where I observed this issue have 16 GB. But I haven't seen the issue for a while as well.
PS: Apparently I should not have written that, it is now back: https://github.com/WeblateOrg/docker/actions/runs/12100517537/job/33739218255?pr=2848
Maybe a naive question, but could this be memory related? I've had this issue appear, and then it went away when I increased the memory limit of the runner to 2GiB, although that's of course no proof that more memory is the solution.
If compiling byte-code uses enough memory that the machine starts to swap thrash, then yes, this is possible.
One thing that I would imagine helping, is if you could set the number of bytecode workers to 1, similar to concurrent builds and concurrent downloads, but I don't see that anywhere in the documentation(?).
How many threads do you have? We currently spawn available_parallelism-many workers.
Maybe a naive question, but could this be memory related? I've had this issue appear, and then it went away when I increased the memory limit of the runner to 2GiB, although that's of course no proof that more memory is the solution.
If compiling byte-code uses enough memory that the machine starts to swap thrash, then yes, this is possible.
One thing that I would imagine helping, is if you could set the number of bytecode workers to 1, similar to concurrent builds and concurrent downloads, but I don't see that anywhere in the documentation(?).
I also had this issue while running a pipeline with docker containers where the first step involved installing libraries with uv. Increasing the amount of memory available to the docker containers solved it for me.
I am occasionally encountering the same issue while emulating aarch64, which is annoying as I need to restart my CI/CD pipeline.
It would be nice to be able to configure the number of workers and timeout for bytecode compilation, so users can play around to try and get something more robust working on their hardware. :)
@adgilbert @0phoff How much memory did you have before and after, and how many core/threads does the machine have?
To add some context: On my machine, i can bytecode compile a project 4437 files in 9.15s on a single efficiency core (taskset -c), i.e. an average of 2ms per file. If we're hitting the 60s timeout, that's 30000x slower than that average, so i'm suspecting there's something more than just too low timeouts going on.
I've created a branch that logs memory statistics on timeout: https://github.com/astral-sh/uv/pull/10673. If you could try triggering the timeout with a uv build from that branch, it would be much appreciated! I've prepared a docker image at ghcr.io/konstin/uv:konsti-gh-6105
, i can bytecode compile a project 4437 files in 9.15s on a single efficiency core (
taskset -c), i.e. an average of 2ms per file. If we're hitting the 60s timeout, that's 30000x slower than that average, so i'm suspecting there's something more than just too low timeouts going on.
From a statistics perspective I don't think average is a useful statistic here, I assume it only takes 1 file to exceed 60 seconds?
There will likely be some distribution of file timings that the average won't reveal, and the average will be skewed by empty Python files (e.g. __init__.py), very small Python files, and files which happen to be warm cached by the OS.
Do you know what the maximum time was? I would probably multiply that by 100x to see if 60 seconds is reasonable number people are hitting, as some people are probably running on hardware up to two orders of magnitude slower (CPU, disk latency, disk speed, and of course random security tools) than you have.
I am occasionally encountering the same issue while emulating
aarch64, which is annoying as I need to restart my CI/CD pipeline.
I'm also dealing with this problem. I have a self-hosted runner, on a very powerful amd64 server, where I'm using Docker Buildx to build images for both amd64 and arm64. The build often hangs indefinitely at uv sync on the emulated arm64 side, and I can't figure out why. It's gotten worse recently.
edit: I should note that I have not seen a problem doing the inverse, i.e., building amd64 images with uv on my ARM Mac.
edit2: Another possibly relevant point is that I see uv sync either completing within a few seconds, or hanging forever.
I am hitting this now too, when building a nix closure under qemu (aarch64) on an x86_64 machine, that itself is a cloud VM, so there are at least two layers of virtualization in my case.
I can also corroborate that the timeout is not easily reproducible, sometimes "just run it again" seems to work.
I am hitting this now too, when building a nix closure under qemu (aarch64) on an x86_64 machine, that itself is a cloud VM, so there are at least two layers of virtualization in my case.
I can also corroborate that the timeout is not easily reproducible, sometimes "just run it again" seems to work.
Our builds went from unpredictably failing on this to almost always. We just switched to same-architecture runners.
We are experiencing a similar problem in our GitLab CI and locally when building ARM64 Docker images on AMD64 machines with QEMU emulation.
uv sync reaches the bytecode compilation phase and then hangs when several workers exit successfully.
Trailing log:
#37 [linux/arm64 build 9/9] RUN --mount=type=cache,target=/root/.cache/uv,sharing=locked <<EOF (cd /project...)
#37 30.50 DEBUG No workspace root found, using project root
#37 30.50 DEBUG Calling `hatchling.build.build_wheel("/root/.cache/uv/builds-v0/.tmpa1vi9Q", {}, None)`
#37 43.10 DEBUG Finished building: my_project @ file:///project
#37 43.10 Built my_project @ file:///project
#37 43.10 DEBUG Released lock at `/root/.cache/uv/sdists-v7/path/db2c85c65f34daaf/.lock`
#37 43.15 Prepared 1 package in 25.56s
#37 43.18 Installed 1 package in 18ms
#37 43.18 DEBUG Starting 32 bytecode compilation workers
#37 47.22 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.22 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.23 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.23 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.26 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.26 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.26 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.26 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.26 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.26 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.26 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.26 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.28 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.28 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.28 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.30 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.33 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.33 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.36 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.40 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.43 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.46 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.56 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.57 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.58 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.64 DEBUG Bytecode compilation worker exiting: Ok(())
#37 47.67 DEBUG Bytecode compilation worker exiting: Ok(())
#37 48.11 DEBUG Bytecode compilation worker exiting: Ok(())
ERROR: Job failed: execution took longer than 1h0m0s seconds
I already reproduced the problem locally but there was no useful output in strace.
Processes were locked in futex calls.
Debug build (ghcr.io/konstin/uv:konsti-gh-6105) did not help.
Backtrace of a hanging child process:
(gdb) backtrace -full -no-filters
#0 0x00000000006cfca1 in __syscall6 (a6=<optimized out>, a5=<optimized out>, a4=<optimized out>, a3=<optimized out>, a2=<optimized out>, a1=<optimized out>, n=<optimized out>)
at ./arch/x86_64/syscall_arch.h:59
ret = <optimized out>
r10 = 0
r8 = 0
r9 = 139826955288576
#1 syscall (n=<optimized out>) at src/misc/syscall.c:20
ap = {{gp_offset = 48, fp_offset = 0, overflow_arg_area = 0x7f2cb7be89c0, reg_save_area = 0x7f2cb7be8980}}
a = 7609100
b = 0
c = 4294967295
d = 0
e = 0
f = <optimized out>
#2 0x0000000000671caf in qemu_event_wait ()
No symbol table info available.
#3 0x000000000067c5e0 in call_rcu_thread ()
No symbol table info available.
#4 0x0000000000671ee2 in qemu_thread_start ()
No symbol table info available.
#5 0x00000000006daadc in start (p=0x7f2cb7be8a90) at src/thread/pthread_create.c:203
args = 0x7f2cb7be8a90
state = <optimized out>
#6 0x00000000006dc0d4 in __clone () at src/thread/x86_64/clone.s:22
No locals.
Thank you for the backtrace!
Can you share what you ran to reproduce this locally?
Temporary solution with timeouts and retries for uv sync commands which helped to fix failing Docker builds:
# syntax=docker/dockerfile:1
FROM ubuntu:noble AS build
# https://docs.docker.com/reference/dockerfile/#automatic-platform-args-in-the-global-scope
ARG TARGETARCH
ARG python_version=3.12
SHELL ["/bin/sh", "-exc"]
...
# https://github.com/astral-sh/uv/pkgs/container/uv
COPY --link --from=ghcr.io/astral-sh/uv:0.6 /uv /usr/local/bin/uv
# https://docs.astral.sh/uv/configuration/environment/
ENV UV_PYTHON="python$python_version" \
UV_PYTHON_DOWNLOADS=never \
UV_PROJECT_ENVIRONMENT=/app \
UV_LINK_MODE=copy \
PYTHONOPTIMIZE=1
COPY pyproject.toml uv.lock /project/
RUN --mount=type=cache,id=/root/.cache/uv-$TARGETARCH,target=/root/.cache/uv,sharing=locked <<EOF
cd /project
uv sync \
$([ "$TARGETARCH" = 'arm64' ] && echo '--verbose') \
--no-dev \
--no-install-project \
--locked
timeout 15m sh -ex <<EOT
until timeout 5m uv sync \
$([ "$TARGETARCH" = 'arm64' ] && echo '--verbose') \
--no-dev \
--no-install-project \
--compile-bytecode
do
echo "Bytecode compilation timed out"
echo "Retrying"
done
EOT
EOF
COPY VERSION /project/
COPY src/ /project/src
RUN --mount=type=cache,id=/root/.cache/uv-$TARGETARCH,target=/root/.cache/uv,sharing=locked <<EOF
cd /project
sed -Ei "s/^(version = \")0\.0\.0(\")$/\1$(cat VERSION)\2/" pyproject.toml
uv sync \
$([ "$TARGETARCH" = 'arm64' ] && echo '--verbose') \
--no-dev \
--no-editable
timeout 9m sh -ex <<EOT
until timeout 3m uv sync \
$([ "$TARGETARCH" = 'arm64' ] && echo '--verbose') \
--no-dev \
--no-editable \
--compile-bytecode
do
echo "Bytecode compilation timed out"
echo "Retrying"
done
EOT
EOF
Log with one retry:
#15 0.823 + timeout 15m sh -ex
#15 0.859 + timeout 5m uv sync --verbose --no-dev --no-install-project --compile-bytecode
#15 0.880 <jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
#15 0.880 <jemalloc>: (This is the expected behaviour if you are running under QEMU)
#15 1.115 DEBUG uv 0.6.1
#15 1.120 DEBUG Found project root: `/project`
#15 1.122 DEBUG No workspace root found, using project root
#15 1.125 DEBUG Acquired lock for `/project`
#15 1.127 DEBUG Using Python request `3.12` from explicit request
#15 1.128 DEBUG Checking for Python environment at `/app`
#15 1.136 DEBUG The virtual environment's Python version satisfies `3.12`
#15 1.137 DEBUG Released lock at `/tmp/uv-b95abc02d7f2ad9b.lock`
#15 1.192 DEBUG Using request timeout of 30s
#15 1.210 DEBUG Found static `pyproject.toml` for: my_project @ file:///project
#15 1.211 DEBUG No workspace root found, using project root
#15 1.223 DEBUG Existing `uv.lock` satisfies workspace requirements
#15 1.223 Resolved 69 packages in 38ms
#15 1.228 DEBUG Omitting `my_project` from resolution due to `--no-install-project`
#15 1.239 DEBUG Using request timeout of 30s
#15 1.248 DEBUG Requirement already installed: alembic==1.14.0
...
#15 1.250 DEBUG Requirement already installed: six==1.17.0
#15 1.253 DEBUG Starting 16 bytecode compilation workers
#15 95.27 DEBUG Bytecode compilation worker exiting: Ok(())
#15 95.28 DEBUG Bytecode compilation worker exiting: Ok(())
#15 95.30 DEBUG Bytecode compilation worker exiting: Ok(())
#15 95.32 DEBUG Bytecode compilation worker exiting: Ok(())
#15 95.37 DEBUG Bytecode compilation worker exiting: Ok(())
#15 95.38 DEBUG Bytecode compilation worker exiting: Ok(())
#15 95.46 DEBUG Bytecode compilation worker exiting: Ok(())
#15 95.62 DEBUG Bytecode compilation worker exiting: Ok(())
#15 95.73 DEBUG Bytecode compilation worker exiting: Ok(())
#15 95.85 DEBUG Bytecode compilation worker exiting: Ok(())
#15 96.17 DEBUG Bytecode compilation worker exiting: Ok(())
#15 96.35 DEBUG Bytecode compilation worker exiting: Ok(())
#15 97.11 DEBUG Bytecode compilation worker exiting: Ok(())
#15 97.44 DEBUG Bytecode compilation worker exiting: Ok(())
#15 98.93 DEBUG Bytecode compilation worker exiting: Ok(())
#15 300.9 + echo Bytecode compilation timed out
#15 300.9 Bytecode compilation timed out
#15 300.9 Retrying
#15 300.9 + echo Retrying
#15 300.9 + timeout 5m uv sync --verbose --no-dev --no-install-project --compile-bytecode
#15 300.9 <jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
#15 300.9 <jemalloc>: (This is the expected behaviour if you are running under QEMU)
#15 301.1 DEBUG uv 0.6.1
#15 301.1 DEBUG Found project root: `/project`
#15 301.1 DEBUG No workspace root found, using project root
#15 301.1 DEBUG Acquired lock for `/project`
#15 301.1 DEBUG Using Python request `3.12` from explicit request
#15 301.1 DEBUG Checking for Python environment at `/app`
#15 301.2 DEBUG The virtual environment's Python version satisfies `3.12`
#15 301.2 DEBUG Released lock at `/tmp/uv-b95abc02d7f2ad9b.lock`
#15 301.2 DEBUG Using request timeout of 30s
#15 301.2 DEBUG Found static `pyproject.toml` for: my_project @ file:///project
#15 301.2 DEBUG No workspace root found, using project root
#15 301.2 DEBUG Existing `uv.lock` satisfies workspace requirements
#15 301.2 Resolved 69 packages in 38ms
#15 301.2 DEBUG Omitting `my_project` from resolution due to `--no-install-project`
#15 301.3 DEBUG Using request timeout of 30s
#15 301.3 DEBUG Requirement already installed: alembic==1.14.0
...
#15 301.3 DEBUG Requirement already installed: six==1.17.0
#15 301.3 DEBUG Starting 16 bytecode compilation workers
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 DEBUG Bytecode compilation worker exiting: Ok(())
#15 304.1 Bytecode compiled 864 files in 2.87s
#15 DONE 304.2s
I will to try to provide more details on local reproduction of the problem later.
I've made a minimal reproducer for this: https://github.com/konstin/gh11699
It looks like the specific combination needs is aarch64 through qemu in a docker container.
I tried to debug this with gdb but it couldn't get it to work. Running gdb in the container errors due to qemu limitations and attaching gdb from the host directly can't resolve any symbols. Using the recommended gdbserver just hangs for me, using debug.sh from the repo in the container:
$ gdb-multiarch
(gdb) set debug remote 1
(gdb) target remote localhost:7777
Remote debugging using localhost:7777
[remote] start_remote_1: enter
[remote] Sending packet: $qSupported:multiprocess+;swbreak+;hwbreak+;qRelocInsn+;fork-events+;vfork-events+;exec-events+;vContSupported+;QThreadEvents+;QThreadOptions+;no-resumed+;memory-tagging+;xmlRegisters=i386#72
[remote] Sending packet: $qSupported:multiprocess+;swbreak+;hwbreak+;qRelocInsn+;fork-events+;vfork-events+;exec-events+;vContSupported+;QThreadEvents+;QThreadOptions+;no-resumed+;memory-tagging+;xmlRegisters=i386#72
[remote] Sending packet: $qSupported:multiprocess+;swbreak+;hwbreak+;qRelocInsn+;fork-events+;vfork-events+;exec-events+;vContSupported+;QThreadEvents+;QThreadOptions+;no-resumed+;memory-tagging+;xmlRegisters=i386#72
[remote] Sending packet: $qSupported:multiprocess+;swbreak+;hwbreak+;qRelocInsn+;fork-events+;vfork-events+;exec-events+;vContSupported+;QThreadEvents+;QThreadOptions+;no-resumed+;memory-tagging+;xmlRegisters=i386#72
[remote] getpkt: Timed out.
[remote] getpkt: Timed out.
[remote] getpkt: Timed out.
Ignoring packet error, continuing...
[remote] packet_ok: Packet qSupported (supported-packets) is supported
warning: unrecognized item "timeout" in "qSupported" response
[remote] Sending packet: $vCont?#49
[remote] Sending packet: $vCont?#49
[remote] Sending packet: $vCont?#49
[remote] Sending packet: $vCont?#49
[remote] getpkt: Timed out.
[remote] getpkt: Timed out.
[remote] getpkt: Timed out.
Ignoring packet error, continuing...
[remote] packet_ok: Packet vCont (verbose-resume) is supported
[remote] Sending packet: $vMustReplyEmpty#3a
[remote] Sending packet: $vMustReplyEmpty#3a
[remote] Sending packet: $vMustReplyEmpty#3a
[remote] Sending packet: $vMustReplyEmpty#3a
[remote] getpkt: Timed out.
[remote] getpkt: Timed out.
[remote] getpkt: Timed out.
Ignoring packet error, continuing...
[remote] start_remote_1: exit
Remote replied unexpectedly to 'vMustReplyEmpty': timeout
I'm only getting this timeout when the gdbserver is running, if it isn't running it's a plain could not connect: Connection timed out.. Both container and host are running ubuntu 24.04.
I think I've tracked this down to a fairly straightforward bug in qemu-user and reported it here: https://gitlab.com/qemu-project/qemu/-/issues/2846
Basically there's an internal structure in qemu-user that tracks open FDs and translates them for the guest process, and they use a lock to protect that structure. But you're not supposed to mix and match locks and fork, and so if your emulated program has one thread that is in the middle of opening or closing a file descriptor while another thread forks, nothing will ever unlock the lock in the child, and so it will deadlock as soon as it tries to open or close a file descriptor.
I think they just didn't think about the fact that they run fork (to handle guest fork) when they added the lock. One "solution" would be to revert the patch adding the lock but presumably that will cause a different race condition.
A more practical workaround would be to reduce the concurrency of bytecode compilation with UV_CONCURRENT_INSTALLS=1, new in 0.6.3 (#11615).
For what it's worth, gdb has tons of problems with getting needlessly confused about Linux namespaces / containers. In this case, since qemu-user-static exists outside of the container, there's no need for gdb to even bother. You can force it to skip its misguided detection logic by giving it a full path, like gdb -p 12345 /usr/bin/qemu-user-static (from outside the container). No gdbserver is necessary. On Ubuntu, I had to set up ddebs (debuginfod didn't seem to work for qemu-user-static), apt install qemu-user-static-dbgsym, and also do a sudo apt-get source qemu inside /usr/src and rename the directory appropriately to get source code.
Very nice work @geofft -- I also read https://gitlab.com/qemu-project/qemu/-/issues/2846 with pleasure.