actix-web icon indicating copy to clipboard operation
actix-web copied to clipboard

Actix HTTP2 error with long blocking requests and async task

Open allencamal opened this issue 3 years ago • 7 comments

Hello, and thank you very much for this crate.

Here is the gist for a reproductible example: https://gist.github.com/allencamal/cfe62dda9fc8132ba6707d5771cac7f1

Expected Behavior

It should be possible to curl the failing endpoint.

Current Behavior

When curling the failing endpoint, curl returns the following error message: curl: (18) transfer closed with 7302400 bytes remaining to read.

Possible Solution

Actually, forcing HTTP1.1 via curl with the --http1.1 option works, but I don't think this is a viable option. Disabling TLS forces HTTP1.1, and also works, but that is not something viable. Also, using spawn_blocking for the blocking sleep fixes the issue, but that does not make the bug less curious.

Steps to Reproduce (for bugs)

Gist: https://gist.github.com/allencamal/cfe62dda9fc8132ba6707d5771cac7f1

  1. Create a new cargo project
  2. Create a basic TLS server with actix-web and rustls
  3. Create an endpoint that spawns a blocking task, sleeps for 15 seconds with std::thread::sleep, and return more than 13 MB of data.
  4. Curl this endpoint. curl --insecure https://localhost:8085/failing > a

Context

Here are the two endpoints:

#[actix_web::get("/working")]
async fn working(req: actix_web::HttpRequest) -> actix_web::Result<actix_web::HttpResponse> {
    log::info!("{:?}", req.version());
    std::thread::sleep(std::time::Duration::from_secs(15));
    Ok(actix_web::HttpResponse::Ok().body(DATA.to_vec()))
}

#[actix_web::get("/failing")]
async fn failing(req: actix_web::HttpRequest) -> actix_web::Result<actix_web::HttpResponse> {
    log::info!("{:?}", req.version());
    actix_web::rt::task::spawn_blocking(|| println!("Request will fail")).await;
    std::thread::sleep(std::time::Duration::from_secs(15));
    Ok(actix_web::HttpResponse::Ok().body(DATA.to_vec()))
}

Here is the gist for a reproductible example: https://gist.github.com/allencamal/cfe62dda9fc8132ba6707d5771cac7f1 I'm creating a very basic HTTP web server, using TLS (so that http2 is available). I've got two endpoints, that basically just sleeps 15 secs (using std's sleep, not tokio's) and returns 20 MB of data. (for less data, e.g. 10 MB, the bug does not appears).

From my experience, sleeping for less than 15 seconds (e.g. 5 seconds) also fails, but anything less than 3 seconds works correctly.

The only difference between the two endpoints is that one calls: actix_web::rt::task::spawn_blocking(|| println!("Request will fail")).await; before sleeping (blocking).

If I curl my two endpoints:

 curl --insecure https://localhost:8085/working > a
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19.0M  100 19.0M    0     0  1271k      0  0:00:15  0:00:15 --:--:-- 4709k
curl --insecure https://localhost:8085/failing > a
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 63 19.0M   63 12.1M    0     0   815k      0  0:00:23  0:00:15  0:00:08 3109k
curl: (18) transfer closed with 7302400 bytes remaining to read

The one that spawn a blocking task fails.

Note that on another project, I have a similar issue, but the error is curl (92) HTTP/2 stream 1 was not closed cleanly before end of the underlying stream.

Your Environment

Cargo.toml included at the beginning of the gist

  • Rust Version (I.e, output of rustc -V): rustc 1.60.0 (7737e0b5c 2022-04-04)
  • Actix Web Version: 4.0.1

allencamal avatar Apr 11 '22 10:04 allencamal

do not use std thread sleep inside async function. it would block thread and starve all other async tasks on the same thread. in this case you are blocking http2 ping/pong.

fakeshadow avatar Apr 12 '22 06:04 fakeshadow

Hello, and thanks for your reply !

Indeed, blocking tasks should be sent to a thread dedicated to running blocking tasks.

But I still don't understand why the working/ endpoint works, and the failing/ doesn't. Both are aggressively using std::thread::sleep, so in my understanding, the working/ endpoint should also fail, but it doesn't.

allencamal avatar Apr 12 '22 06:04 allencamal

It's very simple. One route have await point and one does not. Every await point can potential yield back to runtime for other async task. or in other word you are trying to peak into how yielding would affect the inner state change of actix-web itself from outside.

fakeshadow avatar Apr 12 '22 06:04 fakeshadow

Hello, and thank you again.

I understand that yielding to the runtime may have "side effects", but I still do not understand how. In both cases, as actix-rt is single-threaded, the server is not able to send the HTTP/2 ping/pong. So why are not both requests failing ?

Even worse, in my understanding, yielding gives the runtime an opportunity to send an http2 ping/pong frame, so if one endpoint could work, it would be the failing/ one.

So I guess something is happening in the runtime when the request handler sleeps after being pooled for the second time, but I cannot figure out what.

allencamal avatar Apr 12 '22 07:04 allencamal

Like I said you are trying to figure out the state change from your handler. If you really want to know exactly why you can check the source code of http2 handler here. https://github.com/actix/actix-web/blob/de9e41484af51ad47f8a42ff3191f93e565e6875/actix-http/src/h2/dispatcher.rs#L106

fakeshadow avatar Apr 12 '22 07:04 fakeshadow

Thank you very much, I'll look into it !

allencamal avatar Apr 12 '22 07:04 allencamal

Thank you for pointing the right place to start digging in the code !

After some investigations, I understand that my server never sends an end_of_stream within handle_response. This is because at some point, stream.poll_capacity(cx) returns None, because the inner state of the h2::Send becomes Closed(Error(Io(BrokenPipe, None))).

I still haven't figured out why this is happening in one case and not the other, but this raises another question, which is hopefully a bit more straightforward:
Why is having "no capacity left" not an error, if we couldn't send the whole chunk ?

allencamal avatar Apr 12 '22 15:04 allencamal