reqwest
reqwest copied to clipboard
Request never connects on `armhf`
Hi, this was originally discussed in https://github.com/tokio-rs/mio/issues/1089 where we decided that it probably made sense to migrate the discussion to here.
In brief -- A friend (@cjwatson) and I have been diagnosing a fault in rustup on armhf in Snapcraft's build environment. It seems to sit for 30s trying to connect and then fails. This only seems to happen on armhf -- on other platforms it connects just fine.
An strace of the attempt shows:
[pid 3517] 06:37:57.516581 futex(0xf933b8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=29, tv_nsec=990974355} <unfinished ...>
[pid 3518] 06:37:57.516671 <... fcntl64 resumed> ) = 0x2 (flags O_RDWR)
[pid 3518] 06:37:57.516762 fcntl64(7, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 3518] 06:37:57.516894 connect(7, {sa_family=AF_INET, sin_port=htons(8222), sin_addr=inet_addr("10.10.10.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 3518] 06:37:57.521838 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}}) = 0
[pid 3517] 06:38:27.507984 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
(Further straces show the epoll_ctl() call takes microsecnds, so it's not actually stuck in it for 30s, but the thread which did the epoll_ctl() call subsequently did nothing. (Trace attached to the mio bug so I won't reattach it here).
Interestingly in that strace we never get to epoll_wait() on armhf.
I had previously assumed it was probably mio at fault, but the discussion there suggests it's more likely in the reqwest/tokio interfacing, so I brought the issue to here to discuss further.
Hm, do you have easy access to an armhf machine so we can work through this together?
The futex wait is because the main thread is parking until the async runtime thread makes progress and returns a Response. If we want to eliminate that as a problem, we could try just running the async example. If that doesn't work, then I'd suspect the issue is lower in the stack, either tokio or mio.
I don't have real armhf hardware to hand to try arbitrary stuff on -- those straces came from the snapcraft build infrastructure itself. I will see if I can replicate the issue running in qemu-user-static on my laptop. If I can, I'll see if that example has similar issues.
If somebody can work out how to wedge the relevant test code into snapcraft then I can also try running it on our infrastructure.
I have failed at replicating the issue on my x86 laptop using qemu-user so I imagine we will have to try @cjwatson 's idea -- Problem is, I don't know what I'd do to do that. I'll also see if I can fake the number of CPUs which is reported to rustup in case there's something spawning ncpus threads for the worker pool and that's what's going on.
Even isolating to a 1 CPU VM, I couldn't replicate it on x86_64 with qemu-user so we're going to have to try something else. I am firing up an armhf instance in scaleway (or at least trying to) to see if I can replicate on there.
I failed to replicate it myself. I wonder if it has something to do with the virtualisation that is done for Snapcraft, combined with something else in the stack of reqwest? @seanmonstar is the upstream label suggesting you've filed another bug elsewhere?
I have a Raspberry Pi I can try to reproduce this on, if you think it would help.
The upstream label is a guess that it's either in mio or tokio. Neither reqwest nor hyper have conditional code per target.
Aah, as per the original post, I first discussed this with the mio folks in https://github.com/tokio-rs/mio/issues/1089 and they suggested here. I'm now worried that noone knows what's going on. I'm not sure it'll be platform specific so much as perhaps an interaction between something "interesting" on armhf, and the particular size of the system snapcraft are using. The oddness was that epoll_ctl() was called, but then the epoll was never checked, which points perhaps at an executor with too few threads?
If you have test cases I can help wrangle them into snapcraft with whatever tracing / debugging is needed so we can run that on the infrastructure exhibiting the issue. (I am affected as one of my snaps fails in this way - @cjwatson sent me this way and I'd like to help where I can).
A good first step would be trying the async example, which would help determine if the issue is about the blocking API not allowing epoll to run.
@popey is there any progresses ? Edit: The async example runs well in aarch64-linux-gnu machine. I don't have armhf to test it.
EDIT: sorry, I didn't see kinnison's work using QEMU on this. Not sure if others can reproduce issue by trying different settings like 2+ cpu, etc.
Running Ubuntu 16.04.1 armhf on Qemu
https://gist.github.com/takeshixx/686a4b5e057deff7892913bf69bcb85a
This is a writeup about how to install Ubuntu 16.04.1 Xenial Xerus for the 32-bit hard-float ARMv7 (armhf) architecture on a Qemu VM via Ubuntu netboot.
The setup will create a Ubuntu VM with LPAE extensions (generic-lpae) enabled. However, this writeup should also work for non-LPAE (generic) kernels.
The performance of the resulting VM is quite good, and it allows VMs with >1G ram ... ... The netboot files are available on the official Ubuntu mirror.
First comment on this gist is from Nov 2016 but there are comments as recent as April 20, 2020 that solve networking issues some people had.
@x448 Thanks, but the issue is in the Snapcraft builder VMs, so I'd guess Canonical are okay at configuring qemu properly, and since it tends to work for everything else I remain confused as to why reqwest fails.
We're also using actual hardware, not ARM-on-x86. qemu is still involved, but unlikely to be very much related to that gist.
We may possibly have got to the bottom of this. See https://github.com/lxc/lxcfs/issues/553.
Looks like we should close this off, thank you @cjwatson and @seanmonstar for your efforts.