attohttpc Strange WouldBlock errors.

We're trying to convert our code in lemmy to use attohttpc, (we were using isahc previously), but when doing a lot of concurrent attohttpc gets and posts (in actix threads), we're getting a lot of WouldBlock errors :

Internal Server Error: Error(Io(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" })), in apub::fetcher::fetch_remote_object()

We also tried disabling the tls (since that's the only instance of wouldblock we could find), but it didn't help. For some reason isahc doesn't have these issues.

Related issue : https://github.com/LemmyNet/lemmy/issues/840

Jun 25 '20 19:06 dessalines

Can you check what the ulimit is for file descriptors on the system you're are using? I think the default is like 2048 on a lot of systems.

Jun 25 '20 19:06 sbstp

Its running in docker (on an alpine image), I just checked:

$ ulimit 
unlimited
/ $ ulimit -a
core file size (blocks)         (-c) unlimited
data seg size (kb)              (-d) unlimited
scheduling priority             (-e) 0
file size (blocks)              (-f) unlimited
pending signals                 (-i) 47456
max locked memory (kb)          (-l) 64
max memory size (kb)            (-m) unlimited
open files                      (-n) 1048576
POSIX message queues (bytes)    (-q) 819200
real-time priority              (-r) 0
stack size (kb)                 (-s) 8192
cpu time (seconds)              (-t) unlimited
max user processes              (-u) unlimited
virtual memory (kb)             (-v) unlimited
file locks                      (-x) unlimited

Jun 25 '20 20:06 dessalines

Hmm ok. 1048576 file descriptors should be plenty. Before we investigate further can you try an image that uses gblic instead of musl, like debian?

Jun 25 '20 20:06 sbstp

For some reason isahc doesn't have these issues.

One big difference is that due to isahc using curl, it probably ends up doing connection pooling and hence reusing ports while we will always open a new connection for each request (actually multiple new connections due to happy eyeballs). Hence it could be that you are running out of ports? Could you check if you have a lot of TCP connections in the TIME_WAIT state or if enlarging the available port range helps?

Jun 25 '20 20:06 adamreichold

I took a lot at the man pages for the syscalls we use. We don't use nonblocking sockets so it pretty much eliminates send and recv. That leaves connect which says that there could be insufficient entries in the routing cache.

connect

       EAGAIN For nonblocking UNIX domain sockets, the socket is
              nonblocking, and the connection cannot be completed
              immediately.  For other socket families, there are
              insufficient entries in the routing cache.

send

       EAGAIN or EWOULDBLOCK
              The socket is marked nonblocking and the requested operation
              would block.  POSIX.1-2001 allows either error to be returned
              for this case, and does not require these constants to have
              the same value, so a portable application should check for
              both possibilities.

recv

       EAGAIN or EWOULDBLOCK
              The socket is marked nonblocking and the receive operation
              would block, or a receive timeout had been set and the timeout
              expired before data was received.  POSIX.1 allows either error
              to be returned for this case, and does not require these
              constants to have the same value, so a portable application
              should check for both possibilities.

Jun 25 '20 20:06 sbstp

We don't use nonblocking sockets

But we do use connect_timeout and that makes the socket non-blocking temporarily, c.f. https://doc.rust-lang.org/stable/src/std/sys/unix/net.rs.html#110

Jun 25 '20 20:06 adamreichold

Hence it could be that you are running out of ports?

This is probably non-sense as connect should yield EADDRNOTAVAIL resp. io::ErrorKind::AddrNotAvailable in that case.

Jun 25 '20 20:06 adamreichold

I think I found our culprit: https://doc.rust-lang.org/stable/std/net/struct.TcpStream.html#platform-specific-behavior-1 So read with a read timeout set will yield EWOULDBLOCK on Linux and we do have a default timeout of 30s. :facepalm:

@dessalines Could you try to modify that value using RequestBuilder::read_timeout or Session::read_timeout and maybe try something really large for a start?

Jun 25 '20 20:06 adamreichold

Ya I'll try that in a bit here.

Jun 25 '20 21:06 dessalines

Yep, tried upping to a 300 second read timeout, it failed after that full time was up, same error:

lemmy_beta_1      | [2020-06-25T22:16:05Z ERROR actix_http::response] Internal Server Error: Error(Io(Os { code: 11, kind: WouldBlock, mes
sage: "Resource temporarily unavailable" }))
...
lemmy_alpha_1     | [2020-06-25T22:16:05Z ERROR lemmy_server::websocket::server] Error during message handling Io Error: Resource temporar
ily unavailable (os error 11)                                                                                                             
lemmy_gamma_1     | [2020-06-25T22:16:03Z ERROR actix_http::response] Internal Server Error: Error(Io(Kind(TimedOut)))

I think all the posts are going through, but the gets are failing:

  let timeout = Duration::from_secs(300);
  let text: String = attohttpc::get(url.as_str())
    .header("Accept", APUB_JSON_CONTENT_TYPE)
    .connect_timeout(timeout)
    .read_timeout(timeout)
    .timeout(timeout)
    // .body(())
    .send()?
    .text()?;
  let res: Response = serde_json::from_str(&text)?;

This full error log probably won't be that useful, but since most of the errors are tokio / actix related, my guess is there's some thread fighting between actix and attohttpc that isahc for some reason doesn't have.

Jun 25 '20 22:06 dessalines

my guess is there's some thread fighting between actix and attohttpc that isahc for some reason doesn't have.

attohttpc is completely synchronous blocking the current thread whereas isahc is an asynchronous binding to curl that will allow tokio/actix to continue making progress on that thread. Do you use actix's SyncArbiter or tokio's spawn_blocking to avoid starving its tasks?

Jun 25 '20 23:06 adamreichold

(This might be considered bad marketing or something, but since you already use actix and seem to be motivated by reducing your dependencies, have you considered using actix' own HTTP client awc?)

Jun 25 '20 23:06 adamreichold

I've been investigating a potentially similar problem, but I believe I've been able to rule out attohttpc as the cause. In my case I wasn't running requests in parallel, and the failures happen roughly 5% of the time.

I only see this issue with one (internal) service, and testing with curl I originally did not see the problem. But then I used the --http1.1 option of curl, and that did allow me to see the problem. I suspect there is some misonconfiguration in our backend infrastructure causing this.

It seems both isahc and awc support http2, so it could be that switching to them fixes the problem by avoiding http1.1 bugs exist in whatever backend service you are connecting to.

Feb 28 '21 13:02 JoshMcguigan