actix-web icon indicating copy to clipboard operation
actix-web copied to clipboard

A Stream object in a pending state won't be dropped after the client disconnect

Open masnagam opened this issue 4 years ago • 52 comments

I'm not sure if it was intended, but a Stream object in a pending state won't be dropped after the client disconnects.

I found this issue in actix-web 2.0.0 and created a small test program like below: https://github.com/masnagam/rust-case-studies/blob/master/actix-web-streaming-and-tcp-disconnect/src/main.rs

When executing the following command:

timeout 3 curl -v http://localhost:3000/pending

we never see PendingStream: Dropped in the log messages from the test program.

If a stream returns a Poll::Ready, it will be dropped shortly. We can see that by executing:

timeout 3 curl -v http://localhost:3000/alternate

Is there any workaround to stop polling the pending stream shortly when the client disconnects?

masnagam avatar Jan 23 '20 07:01 masnagam

In my understanding (still new with Rust), if you return pending, your future/stream will never be scheduled again if not awoken by any trigger. See: https://doc.rust-lang.org/std/task/enum.Poll.html#variant.Pending

I think this also will prevent closing down the stream. I had a similar issue and never return pending from the stream, if not waiting on another future/stream.

niveau0 avatar Jan 23 '20 13:01 niveau0

I've run this code and it does print out multiple "PendingStream: Pending" messages after one request so it seems it is getting awoken.

robjtede avatar Jan 23 '20 14:01 robjtede

@niveau0

As @robjtede wrote, poll_next() is called periodically even after the client disconnects. However, it won't be dropped until it returns a Poll::Ready.

One of the possible causes is that actix-web calls write_xxx() only when receiving data from the Stream object. And calling write_xxx() is the only way to detect the TCP disconnection in tokio.

your future/stream will never be scheduled again if not awoken by any trigger.

In my understanding, it depends on how the streaming function is implemented. For example, we might be able to implement it like this:

// Below is a pseudo-code

struct StreamingResponse<SO, ST> {
    socket: SO,
    stream: ST,
}

impl<SO, ST> Future for StreamingResponse<SO, ST>
where
    SO: AsyncWrite,
    ST: Stream,
{
    fn poll(self: Pin<&mut Self>, cx: &mut Context) -> Poll<Self::Output> {
        // Implement the following sequence with a state machine:
        //
        // (1) polling self.socket.poll_write() with an empty buffer
        //       in order to detect the TCP disconnection.
        // (2) polling self.stream.poll_next() in order to receive data
        // (3) polling self.socket.poll_write() in order to write the received data
        //
        // I'm not sure whether (1) is possible...
    }

masnagam avatar Jan 23 '20 14:01 masnagam

@robjtede ok thanks for clarifying, I had no time to test the code, I just had a similar issue and the documentation says

Poll::Pending means that this stream's next value is not ready yet. Implementations will ensure that the current task will be notified when the next value may be ready.

What I really do not understand here is, why the thread executor calls poll_next() ever again after it returned a Pending (independent from the clients disconnect). As far as my understanding goes, the notification must be done by some sort of waker (passed within the context). Here is an example: https://rust-lang.github.io/async-book/02_execution/03_wakeups.html An extra thread is used to call the waker to wake up the task.

So I'm out for now, first need to investigate how the executor really works :-)

niveau0 avatar Jan 23 '20 15:01 niveau0

Prepared a reproduction environment: https://github.com/masnagam/rust-case-studies/tree/master/actix-web-streaming-and-tcp-disconnect

It supports VS Code Remote Containers.

masnagam avatar Jan 26 '20 06:01 masnagam

Investigated about the periodic PendingStream::poll() calls.

The root cause of the periodic PendingStream::poll() calls is a timer for the keep-alive check which is enabled by default. See: https://github.com/actix/actix-web/blob/master/actix-http/src/config.rs#L263

We can disable it by setting keep_alive(0). In this case, the periodic PendingStream::poll() calls stop.

masnagam avatar Jan 26 '20 13:01 masnagam

Investigated how actix-web detects client disconnect.

When the client disconnects, poll_write() in InnerDispatcher::poll_flush() returns Poll::Ready(Ok(0)).

When the write_buf is an empty, InnerDispatcher::poll_flush() returns Ok(false) immediately. This is the reason why actix-web cannot detect the client disconnect while the stream is pending.

We may detect the client disconnect quickly if we can check it like below:

impl<T, S, B, X, U> Future for Dispatcher<T, S, B, X, U>
{
    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        ...
                    loop {
                        ...
                        if inner.check_disconnect() {
                            // client disconnected
                            return Err(...);
                        }

                        if inner.poll_flush(cx)? || !drain {
                            break;
                        }
                    }
    }
}

masnagam avatar Jan 26 '20 14:01 masnagam

I concluded that it's difficult to detect client disconnects while streams are pending.

tokio's TcpStream seems to be based on Rust's TcpStream which has a traditional socket interface. TcpStream's read/write function call with an empty buffer returns Ok(0). So, it's impossible to distinguish between client disconnect and success.

Issue#29 in the tokio project is a similar topic.

I checked behavior of tokio::io::PollEvented::poll_write_ready() on Linux. I analyzed my test program by using VS Code debugger and found that this function doesn't detect client disconnect. tokio::net::TcpStream::poll_write_priv() reached mio::net::TcpStream::write() which detected the client disconnect.

Read/Write with non-empty buffer can detect the client disconnect. But that breaks the HTTP request pipelining and the HTTP response.

There might be a platform specific workaround, but there is no useful functions in crates that actix-web uses at this moment.

Finally, I decided to feed some data at short intervals from my stream implementation in order to detect client disconnect quickly.

masnagam avatar Jan 27 '20 02:01 masnagam

@robjtede You can close this issue if you don't have any additional investigation.

masnagam avatar Jan 27 '20 02:01 masnagam

I haven't investigated further. Great work finding out. It's comforting that the Keep-Alive checks were the cause of the polls.

Are we considering this correct behavior?

robjtede avatar Jan 27 '20 10:01 robjtede

If actix-web writes response data from a stream object for a request that a client disconnected, I think that it's incorrect behavior. Because the response data will be garbage for subsequent pipelined requests on the keep-alive socket. I haven't confirmed, but actix-web might write garbage response data...

If actix-web wakes up tasks which process I/O on non-keep-alive sockets, it's not correct behavior. My test program cannot test that. So, we need to make another test program for that.

masnagam avatar Jan 27 '20 10:01 masnagam

In an old version of tokio, it's possible to observe tcp disconnect by using TcpStream::poll_read_ready(), but it's already removed.

  • https://github.com/tokio-rs/tokio/issues/483#issuecomment-409989018
  • https://github.com/tokio-rs/tokio/issues/1182
  • https://github.com/vi/websocat/commit/d7b0db2f8d12e6aa21dc5f1e3c6a0f04b905dd25

This function was deleted at https://github.com/tokio-rs/tokio/commit/6d8cc4e4755abbd0baac9abf154837b9be011a07.

Probably, we can see the reason for the deletion in a discussion on gitter: https://github.com/tokio-rs/tokio/pull/1392#issuecomment-519533786

masnagam avatar Feb 09 '20 04:02 masnagam

Posted a question about this issue: tokio-rs/tokio/issues/2228

masnagam avatar Feb 09 '20 05:02 masnagam

Thanks for all the investigation. I've run into it as well and ideally the solution would be similar to what Go does, ie. wait for closing a read stream (ie. using TcpStream::poll_read_ready()). I'm not sure how many people care about it, but right now it's kind of a differentiator between Go's net/http and Tokio based ecosystem, ie. if a knowledge about a disconnect is needed, like for a presence system, Tokio based libraries (including actix-web) are not the best choice.

drogus avatar Aug 05 '20 19:08 drogus

I've created tokio-rs/tokio#2743 which restores TcpStream::{poll_read_ready, poll_write_ready}.

masnagam avatar Aug 07 '20 07:08 masnagam

It's encouraging they have acknowledged that issue and PR but I'd rather not wait for a Tokio change since this is the last issue on the v3 blocker list.

Can we say that for now perpetually pending streams are not allowed and remove this as a v3 blocker? It can be fixed a little later. What do you think @masnagam ?

robjtede avatar Aug 08 '20 11:08 robjtede

@robjtede Yes, you can. That's acceptable at least to me.

masnagam avatar Aug 08 '20 11:08 masnagam

Note that the relevant Tokio PR has been merged and methods are available.

https://docs.rs/tokio/1.0.1/tokio/net/struct.TcpStream.html#method.poll_read_ready

robjtede avatar Dec 31 '20 15:12 robjtede

The h1 dispatcher calls on your stream and give you the waker. It's your job to take that waker and properly wake up your pending stream and that is the right way to implement a stream. If you don't do any sort of book keeping and want the dispatcher wake up itself then you are using Rust's async wrong.

But anyway what you want to look into is PollResponse::DoNothing. Your meaningless pending state would trigger this and could lead to pending state of dispatcher. But be ware this could be a breaking change as there are people who do stream the right way by storing their waker and wake up a stream properly.

fakeshadow avatar Jan 29 '21 06:01 fakeshadow

@fakeshadow Your comment above is correct, but you misunderstand the issue I reported. I knew what you pointed out, so I also reported the behavior of timeout 3 curl -v http://localhost:3000/alternate.

The issue I reported is that actix-web has no internal task to detect the client disconnect and cannot detect it until reading or writing data from/to the socket. That means the socket cannot be closed immediately while the stream is pending when the application executes a heavy task to generate data for the stream.

Currently, there is no way to detect the client disconnect in the application. Because the application cannot access to the socket directly. So, I think that actix-web should have a mechanism to detect the client disconnect.

Tokio restored TcpStream::{poll_read_ready, poll_write_ready}. So, it's possible to implement such a mechanism in actix-web without reading/writing anything from/to the socket.

masnagam avatar Jan 29 '21 08:01 masnagam

You should not do heavy task in streams. It's a stream and not single poll future for a reason. If it's a computation or blocking heavy task you can not cancel it properly even if you know the stream is gone between polls. And if you group too much data into each poll you better make it smaller. Stop the stream on error is a basic pattern.

It also worth to notice that actix-web in general is single thread. heavy stream would end up block your whole thread as the dispatcher would be block on your streaming's poll_next. Therefore you would not know client is gone because anything tries to poll dispatcher have to wait.

fakeshadow avatar Jan 29 '21 08:01 fakeshadow

Thank you for your detailed explanation.

If I understand correctly, actix-web cannot call TcpStream::{poll_read_ready, poll_write_ready} while the stream is pending. Because the thread for the streaming response is blocked on the streaming's poll_next.

Is my understanding correct?

I misunderstood that actix-web can wait for stream data without blocking the thread for the response.

masnagam avatar Jan 29 '21 09:01 masnagam

You can. It can happen after a poll_next is returned as pending.

It's poll_next itself that is blocking hense the reason you don't want to put heavy task inside it.

You want to yeild quickly with small chunk of data in a payload stream(32kb max yield point from actix-web's part) and write it to socket stream. If it's failed just fail. You don't lose much resource.

fakeshadow avatar Jan 29 '21 09:01 fakeshadow

Probably, there seems to be some miscommunication between us. So, I like to use a diagram for confirmation of my understanding.

+------------------------------+                          +--------------------+
| Thread-1                     |                          | Thread-2           |
| +------------------------+   |                          | +----------------+ |
| | heavy task to generate ------- Tokio's mpsc-channel ----> actix-web      | |
| | stream data            |   |   providing Bytes        | | stream handler | |
| +------------------------+   |                          | +----------------+ |
+------------------------------+                          +--------------------+

The heavy task is performed on Thread-1 and it generates stream data and then sends it to actix-web running on Thread-2 through the mpsc channel. actix-web reads stream data from the mpsc channel and no heavy task is executed when poll_next is called on the mpsc channel. In this case, is Thread-2 blocked when the poll_next returns Pending?

If the answer is yes, actix-web cannot call TcpStream::{poll_read_ready, poll_write_ready} for detecting the client disconnect.

If no, it may be possible to detect the client disconnect using TcpStream::{poll_read_ready, poll_write_ready} on Thread-2.

We can cancel the heavy task when the mpsc-channel is closed or a message is received through another oneshot channel.

I faced this issue when implementing a personal video recorder (PVR) system which has a REST API to stream data of a TV program which starts at a specific time. A client which wants to record the TV program has to call the REST API before the start time of the TV program, but the streaming doesn't start until the start time. If the client disconnects in some reason, the socket is never closed until the start time.

My usage may be a rare case, but this was a serious issue in my application. Eventually, I introduced a timer like this before sending any data including the HTTP status line and headers.

A similar issue might occur when someone implements a proxy server using actix-web and the proxy server waits a long time for a response from a remote server.

masnagam avatar Jan 29 '21 12:01 masnagam

@masnagam you use case is not that rare. I also have a scenario where I spawn a process with async-process to do some work and would love to get one killed if client already dropped connection and doesn't care about the result process is about to produce. Currently process has to finish, start writing bytes to the response stram and only then there is an error and processing stops.

nazar-pc avatar Jan 29 '21 12:01 nazar-pc

I already said multiple times. You can call poll_read_ready after payload stream is pending. I feel I don't have to repeat myself.

What I'm saying you have to know if it's actually gonna make a difference for you. Let's say your thread-1 is doing it's work, Would you be able to cancel it when dispatcher is waked up and actively drop the payload?

You use a channel to shift the work to another thread and you need a way to cancel or stop it when client disconnect happen and your payload stream is dropped. Heavy computations and blocking file I/O can not be canceled until you finished one call completely. And if you are actively reading an I/O(async or not) it's also a blocking operation until you get interuptted/would block/buffer filled/error. So if you group up too much work into one poll of the stream it would always end up badly no matter where you do it.

Ultimately you don't pay much price with a failed write to closed socket if you have a heavy stream. You pay the price by making it heavy.

fakeshadow avatar Jan 29 '21 12:01 fakeshadow

@fakeshadow Is there any API to return TcpStream? Application needs it for calling poll_read_ready.

I downloaded the source and tried to look for such API, but I couldn't find it. Would you tell me a method name to return TcpStream (or an object which can be used for calling poll_read_ready)?

It seems that we can get tokio::net::TcpStream inside a callback function called from on_connect. But, probably, we cannot copy it outside the callback.

Would you be able to cancel it when dispatcher is waked up and actively drop the payload?

If the heavy task is performed on an external process, we can simply kill it. Otherwise, it's impossible until the heavy task yields CPU voluntarily for the cancelation. (e.g., periodically checking (polling) the connectivity of the mpsc channel in the heavy task)

I used heavy task in my explanation, but that may cause misleading. Actually, in my use case, there is no heavy task to block threads and actix-web just waits for data from an external process providing stream data. It may be the same situation as @nazar-pc.

masnagam avatar Jan 29 '21 14:01 masnagam

@fakeshadow Is there any API to return TcpStream? Application needs it for calling poll_read_ready.

Sorry maybe I'm not being clear enough. A change to actix_http::h1::dispatcher is needed. Mainly related to this line https://github.com/actix/actix-web/blob/51e54dac8b18ea0ac0b1d55f6a9d8af903e0e811/actix-http/src/h1/dispatcher.rs#L428

You can do that withou touching this code but it would be more than hard to do.

In general I'm not against the change. Just stating that you may not gain anything by making this change. A pending stream is not doing anything or cost you anything if your pending logic would end it eventually. It could also add to the overall overhead to h1 dispatcher in general as we have one more point of waking up the whole future.

fakeshadow avatar Jan 29 '21 14:01 fakeshadow

I have no blocking tasks, for me it would be enough if the future that is handling request is just dropped, that would cause chain reaction and stop background process and everything related to it.

Currently I'm not aware of a way to stop request handling until attempting to write the response.

nazar-pc avatar Jan 29 '21 14:01 nazar-pc

@fakeshadow Thank you for your very quick reply! And, sorry for my late response.. (I should study English harder)

I understood the current situation. You're right. The current implementation of actix-web is complex enough. So, introducing new complexity is not good idea in the maintainability point of view. Before the change, you might have to refactor/simplify the code if possible. Rust's async API is some kind of pull-style, so it's a little bit more difficult task to implement that, compared with push-style API like Node.js.

It might be a difficult task for applications to check connectivities of sockets periodically even if actix-web provides API to return TcpStream related to a request/response.

So, it's acceptable at least for me to close this issue as a limitation.

FYI: This issue might cause resource starvation issues. Because:

  • A socket is never closed until stream data is fed
    • The client disconnected may repeatedly retry connecting at short interval and that may consume all sockets (pooled worker threads)
  • A resource used for generating stream data is never freed until the payload is dropped
    • And the drop of the payload is triggered when the client disconnect is detected
    • This is my case. When a client switch TV channels at short interval, a server fails to open tuner device because there is no free tuner device.

@nazar-pc Yes. Currently, it's impossible because of the lack of direct access to TcpStream, but there are some workarounds.

If you can divide stream data from the external process into smaller chunks, or generate a noop data like NULL packet in MPEG TS stream, you can detect the client disconnect quickly.

If I remember correctly, if you can postpone to return a responder like actix_web::HttpResponse::Ok() from your request handler until stream is ready, it also can be possible to detect the client disconnect because actix-web periodically checks the connectivity.

If it's difficult to adopt any workaround and it's a critical issue in your application, you need to request to the actix-web team to implement this feature.

masnagam avatar Jan 29 '21 15:01 masnagam

And, sorry for my late response.. (I should study English harder)

Don't be sorry. I'm not a native English speaker either so we are in the same boat.

The current implementation of actix-web is complex enough. So, introducing new complexity is not good idea in the maintainability point of view.

Adding a notifier for poll_read/write_ready would not add too much overall complexity. It's more about the nature of a giant future make me worry. Adding more point to wakeup a gaint future could have some overhead(I'm not sure about this so I could be wrong).

So, it's acceptable at least for me to close this issue as a limitation.

This issue is worked on and is part of the actix-web v4 milestone so please keep it open. I'm just offering my personal view and it doesn't interfior with the process of this issue.

fakeshadow avatar Jan 29 '21 15:01 fakeshadow

How do you think about calling poll_read_ready at some interval at here? https://github.com/actix/actix-web/blob/51e54dac8b18ea0ac0b1d55f6a9d8af903e0e811/actix-http/src/h1/dispatcher.rs#L428

That requires a new waker for checking the tcp connectivity periodically. Probably, I think this may be a light task, but introduce more complexity than adding the notifier.

Basically, we need a similar implementation in application layer if actix-web defines API to provide TcpStream for the poll_xxx_ready calls. So, I think it may be better to implement it inside actix-web if possible.

Ah, the new waker may kick a giant future... I understood your worry.

masnagam avatar Jan 29 '21 16:01 masnagam

actix-http generics over AsyncRead /AsyncWrite traits where poll_read/write_ready is not lived inside. So an additional trait have to be introduced first to the Io type and anchored on HttpService like types. It would need some work on actix-net related crates before it can be added.(As there are tcp/unix/openssl/rustls/nativetls streams can be handled by httpservice)

The place of adding the notifier could be figured out later after that. When PollResponse::DoNothing is returned there could be better place where it can be added. We don't have or want to intervally call it.

fakeshadow avatar Jan 29 '21 16:01 fakeshadow

Probably, a variable holding a socket like tokio::net::TcpStream for the poll_read/write_ready call live only inside a callback function called by the notifier. The application needs to check the connectivity periodically in order to detect the disconnection within a certain time while the stream is pending. However, the application may not be able to copy the socket for later connectivity checks. Is it possible for actix-web to call the callback repeatedly while the stream is pending?

If PollResponse::DoNothing triggers the notification, actix-web needs to return PollResponse::DoNothing repeatedly and that means poll_response is called repeatedly, and that may kicks a giant future...

That is just my guess. So, I may be wrong.

It might have to add a separate method instead of poll_response, which triggers the notification.

masnagam avatar Jan 29 '21 16:01 masnagam

https://github.com/actix/actix-web/pull/2032

Some help on testing this would be appericated.

fakeshadow avatar Feb 27 '21 14:02 fakeshadow

On my local machine poll_read/write_ready does not actually return error when remote peer is gone. It would always return Ready(()). So you can not really detect disconnect with it.

I had to actually check the flag of READ_DISCONNECT(It's enabled when the stream read returns 0 bytes readed. Indicate the stream is ended.) to actually detect the disconnect. This is problematic in it's own way as you have to read everything from the stream to actually see it's ended. (You do not want to do this as it could blow up your memory usage and be an attack vector)

Everything in this issue is based on poll_ready can detect a socket disconnect but it seems it's not the case.

fakeshadow avatar Feb 27 '21 15:02 fakeshadow

@fakeshadow Thank you for your support.

I tested using my test program with the reproduction steps described in the first comment, and found that the same issue still occurs.

I changed Cargo.toml like below:

diff --git a/actix-web-streaming-and-tcp-disconnect/Cargo.toml b/actix-web-streaming-and-tcp-disconnect/Cargo.toml
index 14abc93..e69b0dc 100644
--- a/actix-web-streaming-and-tcp-disconnect/Cargo.toml
+++ b/actix-web-streaming-and-tcp-disconnect/Cargo.toml
@@ -7,6 +7,15 @@ edition = "2018"
 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

 [dependencies]
-actix-rt = "1.0"
-actix-web = "2.0"
+actix-rt = "2.1"
+actix-web = { git = "https://github.com/actix/actix-web.git", branch = "feature/h1_pending_timer" }
 futures = "0.3"
+
+[patch.crates-io]
+actix-http = { git = "https://github.com/actix/actix-web.git", branch = "feature/h1_pending_timer", path = "actix-http" }
+actix-http-test = { git = "https://github.com/actix/actix-web.git", branch = "feature/h1_pending_timer", path = "actix-http-test" }
+actix-web-actors = { git = "https://github.com/actix/actix-web.git", branch = "feature/h1_pending_timer", path = "actix-web-actors" }
+actix-web-codegen = { git = "https://github.com/actix/actix-web.git", branch = "feature/h1_pending_timer", path = "actix-web-codegen" }
+actix-multipart = { git = "https://github.com/actix/actix-web.git", branch = "feature/h1_pending_timer", path = "actix-multipart" }
+actix-files = { git = "https://github.com/actix/actix-web.git", branch = "feature/h1_pending_timer", path = "actix-files" }
+awc = { git = "https://github.com/actix/actix-web.git", branch = "feature/h1_pending_timer", path = "awc" }

Situations like my test program might be rare, but the test result means that the client disconnect cannot be detected until the stream is ready.

masnagam avatar Feb 28 '21 01:02 masnagam

Yes. With your example the pending stream should be polled in a one second interval. This is happening because the poll_read/write_ready always return Poll::Ready(Ok(())) even after the client is gone so everything is polled once include your pending stream since the poll_ready indicate the stream is alive and ready to be read/write. It does not detect client disconnect.

fakeshadow avatar Feb 28 '21 12:02 fakeshadow

@fakeshadow Thank you for explanation. I got it.

masnagam avatar Mar 01 '21 06:03 masnagam

https://github.com/tokio-rs/tokio/blob/58bd242831ac1e33dbb899b42a9340a5e4162b7b/tokio/src/net/tcp/stream.rs#L413

This method returns Ready where we can look into the bits to see a detailed state.

That said it would be very hard to use in actix-web in current state. It's an async method borrows Stream type making it non static lifetime so it can not safely used in a polled based future type. Unsafe is needed to actually make actix-web can call this method and I've no idea if the Ready state detect disconnect or not.

fakeshadow avatar Mar 16 '21 03:03 fakeshadow

@masnagam

[dependencies]
actix-web = { git = "https://github.com/actix/actix-web.git", default-features = false, branch = "feature/h1_pending_timer" }

[patch.crates-io]
actix-http = { git = "https://github.com/actix/actix-web.git", branch = "feature/h1_pending_timer" }
awc = { git = "https://github.com/actix/actix-web.git", branch = "feature/h1_pending_timer" }
actix-rt = { git = "https://github.com/actix/actix-net.git", branch = "feat/net_poll_ready" }
tokio = { git = "https://github.com/fakeshadow/tokio", branch = "feature/net_poll_ready" }

With these forks I'm able to detect disconnection with your example of PendingStream

fakeshadow avatar Mar 16 '21 05:03 fakeshadow

@fakeshadow I've confirmed that, too. Thank you!

masnagam avatar Mar 17 '21 05:03 masnagam

As a caution to anyone still using Actix v3, this problem may come up if you using Websockets. I had a very nasty bug come up where my server has a cap on the number of connections for a particular endpoint and they were being exhausted. I checked my netstat outputs and couldn't find living connections but the behavior of the server was clear; the connection cap was being reached (I have it set up to return HTTP 503 if the connection cap is reached). I eventually traced the problem to the web::Payload stream not terminating if a client closes a TCP connection abnormally e.g. in the event of a program crash:

#[get("/the-endpoint")]
async fn endpoint(
    connection_tracker: web::Data<ConnectionTracker>,
    http_request: HttpRequest,
    mut http_stream: web::Payload,
)
    -> Result<HttpResponse, actix_web::Error>
{
    /* Websocket and connection setup boilerplate */

    // THE INTERESTING BIT
    tokio::task::spawn_local(async move {
        debug!("Stream reader opening");

        let mut codec = actix_http::ws::Codec::new();
        let mut buf = BytesMut::with_capacity(65536);

        while let Some(chunk_or_err) = http_stream.next().await {
            let chunk = match chunk_or_err {
                Ok(c) => c,
                Err(_) => break,
            };
            buf.put(chunk);
        
            match codec.decode(&mut buf) {
                Ok(maybe_frame) => {
                    if let Some(frame) = maybe_frame {
                        if let Err(_) = incoming_mesgs_tx.send(Ok(frame)).await {
                            break;
                        }
                    }
                },
                Err(protocol_err) => {
                    if let Err(_) = incoming_mesgs_tx.send(Err(protocol_err)).await {
                        break;
                    }
                },
            }
        }

        // I noticed these were not showing up consistently in the logs
        // Ergo the stream was never returning None
        debug!("Stream reader closing");
    });

    /* More response and setup boilerplate */
}

Caveats

I am using Websockets to bridge into a Tokio application so I'm not using the actix actor framework and the associated actix-web-actors system for implementing Websockets. However I see no reason why the actix-web-actors implementation wouldn't also be affected. After reviewing the web actors code, I saw that it simply wraps web::Payload and iterates on it just like my Tokio code. And the problem is coming directly from the web::Payload stream, ergo actor-based Websockets would probably also be afflicted.

Workaround

With Websockets, I found this is fairly easy to work around, albeit imperfectly. After reading this thread, it figured I just need to get the server to trigger a socket read or write semi-regularly in case a client disconnects abnormally.

I set up the server to emit Websocket pings once a second for each client and this has eliminated the problem. Probably just good practice to do this anyways 😄

kmBlaine avatar Sep 02 '21 14:09 kmBlaine