Strange WouldBlock errors.
We're trying to convert our code in lemmy to use attohttpc, (we were using isahc previously), but when doing a lot of concurrent attohttpc gets and posts (in actix threads), we're getting a lot of WouldBlock errors :
Internal Server Error: Error(Io(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" })), in apub::fetcher::fetch_remote_object()
We also tried disabling the tls (since that's the only instance of wouldblock we could find), but it didn't help. For some reason isahc doesn't have these issues.
Related issue : https://github.com/LemmyNet/lemmy/issues/840
Can you check what the ulimit is for file descriptors on the system you're are using? I think the default is like 2048 on a lot of systems.
Its running in docker (on an alpine image), I just checked:
$ ulimit
unlimited
/ $ ulimit -a
core file size (blocks) (-c) unlimited
data seg size (kb) (-d) unlimited
scheduling priority (-e) 0
file size (blocks) (-f) unlimited
pending signals (-i) 47456
max locked memory (kb) (-l) 64
max memory size (kb) (-m) unlimited
open files (-n) 1048576
POSIX message queues (bytes) (-q) 819200
real-time priority (-r) 0
stack size (kb) (-s) 8192
cpu time (seconds) (-t) unlimited
max user processes (-u) unlimited
virtual memory (kb) (-v) unlimited
file locks (-x) unlimited
Hmm ok. 1048576 file descriptors should be plenty. Before we investigate further can you try an image that uses gblic instead of musl, like debian?
For some reason isahc doesn't have these issues.
One big difference is that due to isahc using curl, it probably ends up doing connection pooling and hence reusing ports while we will always open a new connection for each request (actually multiple new connections due to happy eyeballs). Hence it could be that you are running out of ports? Could you check if you have a lot of TCP connections in the TIME_WAIT state or if enlarging the available port range helps?
I took a lot at the man pages for the syscalls we use. We don't use nonblocking sockets so it pretty much eliminates send and recv. That leaves connect which says that there could be insufficient entries in the routing cache.
connect
EAGAIN For nonblocking UNIX domain sockets, the socket is
nonblocking, and the connection cannot be completed
immediately. For other socket families, there are
insufficient entries in the routing cache.
send
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and the requested operation
would block. POSIX.1-2001 allows either error to be returned
for this case, and does not require these constants to have
the same value, so a portable application should check for
both possibilities.
recv
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and the receive operation
would block, or a receive timeout had been set and the timeout
expired before data was received. POSIX.1 allows either error
to be returned for this case, and does not require these
constants to have the same value, so a portable application
should check for both possibilities.
We don't use nonblocking sockets
But we do use connect_timeout and that makes the socket non-blocking temporarily, c.f. https://doc.rust-lang.org/stable/src/std/sys/unix/net.rs.html#110
Hence it could be that you are running out of ports?
This is probably non-sense as connect should yield EADDRNOTAVAIL resp. io::ErrorKind::AddrNotAvailable in that case.
I think I found our culprit: https://doc.rust-lang.org/stable/std/net/struct.TcpStream.html#platform-specific-behavior-1 So read with a read timeout set will yield EWOULDBLOCK on Linux and we do have a default timeout of 30s. :facepalm:
@dessalines Could you try to modify that value using RequestBuilder::read_timeout or Session::read_timeout and maybe try something really large for a start?
Ya I'll try that in a bit here.
Yep, tried upping to a 300 second read timeout, it failed after that full time was up, same error:
lemmy_beta_1 | [2020-06-25T22:16:05Z ERROR actix_http::response] Internal Server Error: Error(Io(Os { code: 11, kind: WouldBlock, mes
sage: "Resource temporarily unavailable" }))
...
lemmy_alpha_1 | [2020-06-25T22:16:05Z ERROR lemmy_server::websocket::server] Error during message handling Io Error: Resource temporar
ily unavailable (os error 11)
lemmy_gamma_1 | [2020-06-25T22:16:03Z ERROR actix_http::response] Internal Server Error: Error(Io(Kind(TimedOut)))
I think all the posts are going through, but the gets are failing:
let timeout = Duration::from_secs(300);
let text: String = attohttpc::get(url.as_str())
.header("Accept", APUB_JSON_CONTENT_TYPE)
.connect_timeout(timeout)
.read_timeout(timeout)
.timeout(timeout)
// .body(())
.send()?
.text()?;
let res: Response = serde_json::from_str(&text)?;
This full error log probably won't be that useful, but since most of the errors are tokio / actix related, my guess is there's some thread fighting between actix and attohttpc that isahc for some reason doesn't have.
my guess is there's some thread fighting between actix and attohttpc that isahc for some reason doesn't have.
attohttpc is completely synchronous blocking the current thread whereas isahc is an asynchronous binding to curl that will allow tokio/actix to continue making progress on that thread. Do you use actix's SyncArbiter or tokio's spawn_blocking to avoid starving its tasks?
(This might be considered bad marketing or something, but since you already use actix and seem to be motivated by reducing your dependencies, have you considered using actix' own HTTP client awc?)
I've been investigating a potentially similar problem, but I believe I've been able to rule out attohttpc as the cause. In my case I wasn't running requests in parallel, and the failures happen roughly 5% of the time.
I only see this issue with one (internal) service, and testing with curl I originally did not see the problem. But then I used the --http1.1 option of curl, and that did allow me to see the problem. I suspect there is some misonconfiguration in our backend infrastructure causing this.
It seems both isahc and awc support http2, so it could be that switching to them fixes the problem by avoiding http1.1 bugs exist in whatever backend service you are connecting to.