reqwest icon indicating copy to clipboard operation
reqwest copied to clipboard

Request never connects on `armhf`

Open kinnison opened this issue 6 years ago • 16 comments

Hi, this was originally discussed in https://github.com/tokio-rs/mio/issues/1089 where we decided that it probably made sense to migrate the discussion to here.

In brief -- A friend (@cjwatson) and I have been diagnosing a fault in rustup on armhf in Snapcraft's build environment. It seems to sit for 30s trying to connect and then fails. This only seems to happen on armhf -- on other platforms it connects just fine.

An strace of the attempt shows:

[pid  3517] 06:37:57.516581 futex(0xf933b8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=29, tv_nsec=990974355} <unfinished ...>
[pid  3518] 06:37:57.516671 <... fcntl64 resumed> ) = 0x2 (flags O_RDWR)
[pid  3518] 06:37:57.516762 fcntl64(7, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid  3518] 06:37:57.516894 connect(7, {sa_family=AF_INET, sin_port=htons(8222), sin_addr=inet_addr("10.10.10.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid  3518] 06:37:57.521838 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLET, {u32=0, u64=0}}) = 0
[pid  3517] 06:38:27.507984 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

(Further straces show the epoll_ctl() call takes microsecnds, so it's not actually stuck in it for 30s, but the thread which did the epoll_ctl() call subsequently did nothing. (Trace attached to the mio bug so I won't reattach it here).

Interestingly in that strace we never get to epoll_wait() on armhf.

I had previously assumed it was probably mio at fault, but the discussion there suggests it's more likely in the reqwest/tokio interfacing, so I brought the issue to here to discuss further.

kinnison avatar Sep 18 '19 14:09 kinnison

Hm, do you have easy access to an armhf machine so we can work through this together?

The futex wait is because the main thread is parking until the async runtime thread makes progress and returns a Response. If we want to eliminate that as a problem, we could try just running the async example. If that doesn't work, then I'd suspect the issue is lower in the stack, either tokio or mio.

seanmonstar avatar Sep 19 '19 17:09 seanmonstar

I don't have real armhf hardware to hand to try arbitrary stuff on -- those straces came from the snapcraft build infrastructure itself. I will see if I can replicate the issue running in qemu-user-static on my laptop. If I can, I'll see if that example has similar issues.

kinnison avatar Sep 20 '19 07:09 kinnison

If somebody can work out how to wedge the relevant test code into snapcraft then I can also try running it on our infrastructure.

cjwatson avatar Sep 20 '19 07:09 cjwatson

I have failed at replicating the issue on my x86 laptop using qemu-user so I imagine we will have to try @cjwatson 's idea -- Problem is, I don't know what I'd do to do that. I'll also see if I can fake the number of CPUs which is reported to rustup in case there's something spawning ncpus threads for the worker pool and that's what's going on.

kinnison avatar Sep 20 '19 07:09 kinnison

Even isolating to a 1 CPU VM, I couldn't replicate it on x86_64 with qemu-user so we're going to have to try something else. I am firing up an armhf instance in scaleway (or at least trying to) to see if I can replicate on there.

kinnison avatar Sep 20 '19 08:09 kinnison

I failed to replicate it myself. I wonder if it has something to do with the virtualisation that is done for Snapcraft, combined with something else in the stack of reqwest? @seanmonstar is the upstream label suggesting you've filed another bug elsewhere?

kinnison avatar Oct 02 '19 19:10 kinnison

I have a Raspberry Pi I can try to reproduce this on, if you think it would help.

lnicola avatar Oct 02 '19 19:10 lnicola

The upstream label is a guess that it's either in mio or tokio. Neither reqwest nor hyper have conditional code per target.

seanmonstar avatar Oct 02 '19 20:10 seanmonstar

Aah, as per the original post, I first discussed this with the mio folks in https://github.com/tokio-rs/mio/issues/1089 and they suggested here. I'm now worried that noone knows what's going on. I'm not sure it'll be platform specific so much as perhaps an interaction between something "interesting" on armhf, and the particular size of the system snapcraft are using. The oddness was that epoll_ctl() was called, but then the epoll was never checked, which points perhaps at an executor with too few threads?

kinnison avatar Oct 02 '19 20:10 kinnison

If you have test cases I can help wrangle them into snapcraft with whatever tracing / debugging is needed so we can run that on the infrastructure exhibiting the issue. (I am affected as one of my snaps fails in this way - @cjwatson sent me this way and I'd like to help where I can).

popey avatar Oct 09 '19 16:10 popey

A good first step would be trying the async example, which would help determine if the issue is about the blocking API not allowing epoll to run.

seanmonstar avatar Oct 09 '19 16:10 seanmonstar

@popey is there any progresses ? Edit: The async example runs well in aarch64-linux-gnu machine. I don't have armhf to test it.

tesuji avatar Dec 10 '19 04:12 tesuji

EDIT: sorry, I didn't see kinnison's work using QEMU on this. Not sure if others can reproduce issue by trying different settings like 2+ cpu, etc.

Running Ubuntu 16.04.1 armhf on Qemu
https://gist.github.com/takeshixx/686a4b5e057deff7892913bf69bcb85a

This is a writeup about how to install Ubuntu 16.04.1 Xenial Xerus for the 32-bit hard-float ARMv7 (armhf) architecture on a Qemu VM via Ubuntu netboot.

The setup will create a Ubuntu VM with LPAE extensions (generic-lpae) enabled. However, this writeup should also work for non-LPAE (generic) kernels.

The performance of the resulting VM is quite good, and it allows VMs with >1G ram ... ... The netboot files are available on the official Ubuntu mirror.

First comment on this gist is from Nov 2016 but there are comments as recent as April 20, 2020 that solve networking issues some people had.

x448 avatar Apr 27 '20 23:04 x448

@x448 Thanks, but the issue is in the Snapcraft builder VMs, so I'd guess Canonical are okay at configuring qemu properly, and since it tends to work for everything else I remain confused as to why reqwest fails.

kinnison avatar Apr 28 '20 19:04 kinnison

We're also using actual hardware, not ARM-on-x86. qemu is still involved, but unlikely to be very much related to that gist.

cjwatson avatar Apr 28 '20 20:04 cjwatson

We may possibly have got to the bottom of this. See https://github.com/lxc/lxcfs/issues/553.

cjwatson avatar Aug 25 '22 19:08 cjwatson

Looks like we should close this off, thank you @cjwatson and @seanmonstar for your efforts.

kinnison avatar Jan 12 '23 08:01 kinnison