actix-web icon indicating copy to clipboard operation
actix-web copied to clipboard

WebSocket actor can be paralyzed due lack of IO timeout, leading to a connection leak.

Open finnbear opened this issue 3 years ago • 6 comments

Expected Behavior

An IntervalFunc spawned in a WebSocket Actor will run, regardless of how much I/O is pending. That way, it can do ping and close the connection if pong is not received.

Current Behavior

If an HTTP connection is severed at the client side without notification of closure, and the WebSocket Actor sends a certain amount of messages (text/binary), it becomes paralyzed and unable to run IntervalFunc's which could shut it down (if ping/pong takes too long).

~~Possible Problem~~

~~I believe this bug is due to how actix's AsyncContext works. The context keeps track of multiple "items" which would include an IntervalFunc but a single "wait" future. There is logic in place to poll a different "item" every time to avoid a perpetual wait, but there doesn't seem to be logic to poll a different "wait" every time.~~

~~I'm reporting this issue on the actix-web repo, because it seems like AsyncContext assumes that it will only have to track one wait future at a time, and that actix-web-actors should be responsible for working around that limitation.~~

Nevermind this theory, not sure why I thought a wait was being added on account of sending websocket messages.

Possible Solutions

Use I/O timeouts (as shown here), or user-space timeouts on top of I/O. That way, waiting on a congested connection is bounded.

Alternatively, add some way to close the connection if the outbound buffer is full, before all IntervalFuncs are paralyzed.

Alternatively, move this issue to actix and handle multiple items' wait futures.

Not a solution

I don't think it is possible to rely on web_socket_actor.send(message) returning Err(Full). Consider the case of sending a single message to a web socket actor that causes that actor to send multiple megabytes to a gone (not closed) client. It wont fill the actor's mailbox, but it could saturate the network adapter. Also, the first time Err(Full) is returned, it is probably already too late. Finally, this would place a huge burden on the central actor to wait for a response from each web socket actor.

Steps to Reproduce (for bugs)

  1. Host a WebSocket server at an internet address with a run_interval that does pong and closes the connection if it doesn't get ping. ~~The chat server example should suffice, with a modification to facilitate step 4 and 5.~~ (See below)
  2. Open a WebSocket from a computer with dis-connectable internet (e.g. a laptop with WiFi)
  3. Disconnect the client's internet (e.g. turn off the WiFi)
  4. Send at least 1MB from the server on the WebSocket in several small messages (idk what the threshold is, but the idea is to exhaust the network interface so it prompts further I/O to wait)
  5. Observe that the interval function no longer runs (will not see any println!("...") from it), the WebSocket remains open (forever), and the TCP connection is leaked (forever)

~~A code example is TBD.~~

Update: A code example can be found here. Simply by changing the amount of outbound bandwidth in this line, the IntervalFunc goes from working (printing when it pings/closes, and actually closes the connection) to not working.

Workaround

Make sure the ping/pong interval function closes an unresponsive WebSocket faster than it can exhaust the network buffer. With my app, 15s works for this purpose.

Stopping the web socket actor if it's mailbox becomes full might help in some cases, but I don't think this is foolproof as I discussed in "Not a solution"

Context

This bug is causing TCP connections to be leaked on my production server (although the situation is less urgent after applying the workaround). This is because the WebSocket in question is sent a large amount of messages from the server (10/s), my users' clients' networks are not totally reliable, and my IntervalFunc doesn't intervene before the network interface is exhausted (paralysis occurs ~40 seconds after connection is severed, but my ping/pong code only stops the connection after 90 seconds of inactivity.

actix = "0.12"
actix-codec = "0.4.0"
actix-web = {version="4.0.0-beta.9", features=["rustls"]}
actix-web-actors = "4.0.0-beta.7"

Your Environment

  • Rust Version (I.e, output of rustc -V): rustc 1.58.0-nightly
  • Actix Web Version: master branch, at the time of writing

finnbear avatar Nov 11 '21 20:11 finnbear

Why do you mention actix's context wait queue ? The wait queue will indeed block the actor from doing pretty much any other processing, that's its intention, but where on actix-web-actors do we push items to the wait queue ?

thalesfragoso avatar Nov 17 '21 20:11 thalesfragoso

Why do you mention actix's context wait queue ? The wait queue will indeed block the actor from doing pretty much any other processing, that's its intention, but where on actix-web-actors do we push items to the wait queue ?

Thanks for pointing out this issue with my original theory. Looking at the code again, I can't find any evidence that a wait was added on account of sending WebSocket messages. I must have been confused by something, because now it seems like the messages are taken on-demand using a stream API.

However, the bug still stands. Somehow, sending too many messages (when the client is no longer reading them) can paralyze the actor so its IntervalFunc's stop running (my code example demonstrates this). The problem is probably more complicated than my original theory.

finnbear avatar Nov 17 '21 20:11 finnbear

I might have found a clue. The future of the WebSocketContext isn't spawned directly on the runtime as the usual Actor::start, instead it's returned as a streaming HttpResponse to be polled by a Dispatcher from actix-http. So even though the WebSocketContext is more than just a Stream (e.g. you can spawn futures, wait on futures, etc), the Dispatcher only treats it as a Stream. The Dispatcher buffers IO and enforces a maximum buffer size, only pulling the stream if the buffer isn't full:

https://github.com/actix/actix-web/blob/66620a101211c424ffc7e2f1b1c7bb9e2a78ff87/actix-http/src/h1/dispatcher.rs#L423-L424 Which means that a jammed IO will block your actor completely, including ActorFutures spawned on its context (Currently, payload::MAX_BUFFER_SIZE is 32k).

h1::Dispatcher does have a keep alive timer, but it seems it will only go to shutdown if the write_buffer is empty.

thalesfragoso avatar Nov 17 '21 23:11 thalesfragoso

A minimal reproduction example repo would be helpful for determining how to help.

robjtede avatar Nov 18 '21 10:11 robjtede

@finnbear What OS are you using ? poll_write on TcpStream will eventually return an error on a severed connection, but how long it takes will vary with your OS and its configuration.

If you are on linux you might try to play with tcp_retries2 to decrease this time and see if the connection is closed after it. From the man pages:

tcp_retries2 (integer; default: 15; since Linux 2.2)
              The maximum number of times a TCP packet is retransmitted
              in established state before giving up.  The default value
              is 15, which corresponds to a duration of approximately
              between 13 to 30 minutes, depending on the retransmission
              timeout.  The RFC 1122 specified minimum limit of 100
              seconds is typically deemed too short.

PS: I don't think the timeout duration scales linearly with this setting, it's probably exponential.

thalesfragoso avatar Nov 18 '21 17:11 thalesfragoso

A minimal reproduction example repo would be helpful for determining how to help.

Sure, I put one on the ws-mre branch: https://github.com/finnbear/actix-tcp-leak/tree/ws-mre

What OS are you using?

Server is Debian 11, the computers I test on have Ubuntu ~20.04. On my Ubuntu 20.04.2 laptop, I ran:

$ echo 3 | sudo tee /proc/sys/net/ipv4/tcp_retries2
3

According to this source, this corresponds to a 1.4 second delay before timeout (my previous setting would have taken 13 minutes to timeout).

However, my 20 leaked connections are still going strong after ~~3~~ ~~10~~ 75 minutes (and the interval func that should have realized the problem doesn't print anything, indicating that it is dead)

I will leave the program running for as long as I can and update the above duration with my findings.

finnbear avatar Nov 18 '21 21:11 finnbear

Hello!

I think I'm experiencing the same issue. Are you aware of any workarounds for it?

singerxt avatar Aug 16 '22 19:08 singerxt

Are you aware of any workarounds for it?

See my original issue. A good enough workaround is to send pings periodically, and terminate the connection rapidly enough if pongs are not received. If you are worried about malicious clients, include some random number in your pings and compare it with the echoed data in the pongs.

The theory is that, for ping/pong to work, the network buffers must not have been clogged at some point during the test. If you do this test often enough, you can terminate the connection before it is too late.

Finally, I feel bad saying this here, but you may want to look into axum as an alternative to actix-web. axum's websocket implementation is based on tungstenite, whereas actix-web's seems custom.

finnbear avatar Aug 17 '22 02:08 finnbear

@finnbear Thank you so much that makes sense to me. I playing around with this and I think Axum might be a very good alternative.

singerxt avatar Aug 17 '22 14:08 singerxt

Hello!

I did a couple of findings around this issue, and it looks like having global global Arc< Mutex> is the source of the problem.

When you change .lock() to .try_unlock, you will notice a bunch of errors where something is trying to access Mutex when another asynchronous function blocks it.

I'm not really convinced that this is an active issue, and refactoring code to async aware Mutex like this or this might be a fix?

singerxt avatar Aug 19 '22 12:08 singerxt

Why you change .lock() to .try_unlock you will notice a bunch of errors where something is trying to access Mutex when it is blocked by another asynchronous function.

I don't doubt that there could be some unlock errors using try_lock, especially when the actors are starting, but these errors are transient and just indicate contention on the Mutex. It doesn't indicate to me that the Mutex is at fault. For the Mutex to be at fault, there would need to be a deadlock that prevents the program from making progress. There are two ways a deadlock could occur:

  1. One thread acquires the lock twice (not possible, as node code path attempts this)
  2. There were two or more locks acquired in cyclic way (not possible, as there is only one Mutex)

If blocking on the lock explained the actors failing to stop, you would expect the first println! would run but not the second:

fn stopped(&mut self, ctx: &mut Self::Context) {
        println!("A firehose is stopping");
        FIREHOSES.lock().unwrap().remove(&ctx.address().recipient());
        println!("A firehose stopped");
    }

However, when the data size is set to 10000, neither println! runs and the actors stay running. While running, they don't touch the mutex except when starting (they all start fine) and stopping (they don't even try to stop). It doesn't matter if the sender blocks, as it is running it its own spawned thread, but I have also verified it doesn't block for significant time (it sends to an unbounded queue).

I've updated my code to use the latest version of actix* and the issue is still reproducible.

I'm not really convinced that this is an actix issue and refactoring code to async aware mutex like this or this.

According to the async mutex documentation, the main benefit is that you can hold them across await points, which my code isn't doing.

finnbear avatar Aug 19 '22 15:08 finnbear

@finnbear thank you for all your time. I was experiencing different issues, but your case still seems to be a bug.

singerxt avatar Aug 19 '22 20:08 singerxt

@finnbear thank you for all your time. I was experiencing different issues, but your case still seems to be a bug.

No problem, glad you were able to resolve your issue! :+1:

finnbear avatar Aug 19 '22 21:08 finnbear

Since the actix actor framework is in maintenance mode and the workaround for this issue is clear and reasonable, I'm going to close this.

We are exploring a non-actor alternative to WebSockets to be built-in to Actix Web in the near future.
For now the actix-ws crate is a good alternative. Examples of it's use can be found here under the *-actorless directories.

robjtede avatar Sep 11 '22 12:09 robjtede