actix-web
actix-web copied to clipboard
WebSocket actor can be paralyzed due lack of IO timeout, leading to a connection leak.
Expected Behavior
An IntervalFunc
spawn
ed in a WebSocket Actor
will run, regardless of how much I/O is pending. That way, it can do ping
and close the connection if pong
is not received.
Current Behavior
If an HTTP connection is severed at the client side without notification of closure, and the WebSocket Actor
sends a certain amount of messages (text/binary), it becomes paralyzed and unable to run IntervalFunc
's which could shut it down (if ping
/pong
takes too long).
~~Possible Problem~~
~~I believe this bug is due to how actix
's AsyncContext
works. The context keeps track of multiple "items" which would include an IntervalFunc
but a single "wait" future. There is logic in place to poll a different "item" every time to avoid a perpetual wait, but there doesn't seem to be logic to poll a different "wait" every time.~~
~~I'm reporting this issue on the actix-web
repo, because it seems like AsyncContext
assumes that it will only have to track one wait
future at a time, and that actix-web-actors
should be responsible for working around that limitation.~~
Nevermind this theory, not sure why I thought a wait
was being added on account of sending websocket messages.
Possible Solutions
Use I/O timeouts (as shown here), or user-space timeouts on top of I/O. That way, wait
ing on a congested connection is bounded.
Alternatively, add some way to close the connection if the outbound buffer is full, before all IntervalFunc
s are paralyzed.
Alternatively, move this issue to actix
and handle multiple items' wait futures.
Not a solution
I don't think it is possible to rely on web_socket_actor.send(message)
returning Err(Full)
. Consider the case of sending a single message to a web socket actor that causes that actor to send multiple megabytes to a gone (not closed) client. It wont fill the actor's mailbox, but it could saturate the network adapter. Also, the first time Err(Full)
is returned, it is probably already too late. Finally, this would place a huge burden on the central actor to wait for a response from each web socket actor.
Steps to Reproduce (for bugs)
- Host a WebSocket server at an internet address with a
run_interval
that doespong
and closes the connection if it doesn't getping
. ~~The chat server example should suffice, with a modification to facilitate step 4 and 5.~~ (See below) - Open a WebSocket from a computer with dis-connectable internet (e.g. a laptop with WiFi)
- Disconnect the client's internet (e.g. turn off the WiFi)
- Send at least 1MB from the server on the WebSocket in several small messages (idk what the threshold is, but the idea is to exhaust the network interface so it prompts further I/O to wait)
- Observe that the interval function no longer runs (will not see any
println!("...")
from it), the WebSocket remains open (forever), and the TCP connection is leaked (forever)
~~A code example is TBD.~~
Update: A code example can be found here. Simply by changing the amount of outbound bandwidth in this line, the IntervalFunc goes from working (printing when it pings/closes, and actually closes the connection) to not working.
Workaround
Make sure the ping
/pong
interval function closes an unresponsive WebSocket
faster than it can exhaust the network buffer. With my app, 15s works for this purpose.
Stopping the web socket actor if it's mailbox becomes full might help in some cases, but I don't think this is foolproof as I discussed in "Not a solution"
Context
This bug is causing TCP connections to be leaked on my production server (although the situation is less urgent after applying the workaround). This is because the WebSocket in question is sent a large amount of messages from the server (10/s), my users' clients' networks are not totally reliable, and my IntervalFunc
doesn't intervene before the network interface is exhausted (paralysis occurs ~40 seconds after connection is severed, but my ping
/pong
code only stops the connection after 90 seconds of inactivity.
actix = "0.12"
actix-codec = "0.4.0"
actix-web = {version="4.0.0-beta.9", features=["rustls"]}
actix-web-actors = "4.0.0-beta.7"
Your Environment
- Rust Version (I.e, output of
rustc -V
):rustc 1.58.0-nightly
- Actix Web Version:
master
branch, at the time of writing
Why do you mention actix's context wait queue ? The wait queue will indeed block the actor from doing pretty much any other processing, that's its intention, but where on actix-web-actors
do we push items to the wait queue ?
Why do you mention actix's context wait queue ? The wait queue will indeed block the actor from doing pretty much any other processing, that's its intention, but where on
actix-web-actors
do we push items to the wait queue ?
Thanks for pointing out this issue with my original theory. Looking at the code again, I can't find any evidence that a wait
was added on account of sending WebSocket messages. I must have been confused by something, because now it seems like the messages are taken on-demand using a stream API.
However, the bug still stands. Somehow, sending too many messages (when the client is no longer reading them) can paralyze the actor so its IntervalFunc
's stop running (my code example demonstrates this). The problem is probably more complicated than my original theory.
I might have found a clue. The future of the WebSocketContext
isn't spawned directly on the runtime as the usual Actor::start
, instead it's returned as a streaming HttpResponse
to be polled by a Dispatcher
from actix-http
. So even though the WebSocketContext
is more than just a Stream
(e.g. you can spawn futures, wait on futures, etc), the Dispatcher
only treats it as a Stream
. The Dispatcher
buffers IO and enforces a maximum buffer size, only pulling the stream if the buffer isn't full:
https://github.com/actix/actix-web/blob/66620a101211c424ffc7e2f1b1c7bb9e2a78ff87/actix-http/src/h1/dispatcher.rs#L423-L424
Which means that a jammed IO will block your actor completely, including ActorFuture
s spawned on its context (Currently, payload::MAX_BUFFER_SIZE
is 32k).
h1::Dispatcher
does have a keep alive timer, but it seems it will only go to shutdown if the write_buffer
is empty.
A minimal reproduction example repo would be helpful for determining how to help.
@finnbear What OS are you using ? poll_write
on TcpStream
will eventually return an error on a severed connection, but how long it takes will vary with your OS and its configuration.
If you are on linux you might try to play with tcp_retries2
to decrease this time and see if the connection is closed after it. From the man pages:
tcp_retries2 (integer; default: 15; since Linux 2.2)
The maximum number of times a TCP packet is retransmitted
in established state before giving up. The default value
is 15, which corresponds to a duration of approximately
between 13 to 30 minutes, depending on the retransmission
timeout. The RFC 1122 specified minimum limit of 100
seconds is typically deemed too short.
PS: I don't think the timeout duration scales linearly with this setting, it's probably exponential.
A minimal reproduction example repo would be helpful for determining how to help.
Sure, I put one on the ws-mre
branch: https://github.com/finnbear/actix-tcp-leak/tree/ws-mre
What OS are you using?
Server is Debian 11, the computers I test on have Ubuntu ~20.04. On my Ubuntu 20.04.2 laptop, I ran:
$ echo 3 | sudo tee /proc/sys/net/ipv4/tcp_retries2
3
According to this source, this corresponds to a 1.4 second delay before timeout (my previous setting would have taken 13 minutes to timeout).
However, my 20 leaked connections are still going strong after ~~3~~ ~~10~~ 75 minutes (and the interval func that should have realized the problem doesn't print anything, indicating that it is dead)
I will leave the program running for as long as I can and update the above duration with my findings.
Hello!
I think I'm experiencing the same issue. Are you aware of any workarounds for it?
Are you aware of any workarounds for it?
See my original issue. A good enough workaround is to send pings periodically, and terminate the connection rapidly enough if pongs are not received. If you are worried about malicious clients, include some random number in your pings and compare it with the echoed data in the pongs.
The theory is that, for ping/pong to work, the network buffers must not have been clogged at some point during the test. If you do this test often enough, you can terminate the connection before it is too late.
Finally, I feel bad saying this here, but you may want to look into axum
as an alternative to actix-web
. axum
's websocket implementation is based on tungstenite
, whereas actix-web
's seems custom.
@finnbear Thank you so much that makes sense to me. I playing around with this and I think Axum might be a very good alternative.
Hello!
I did a couple of findings around this issue, and it looks like having global global Arc< Mutex> is the source of the problem.
When you change .lock()
to .try_unlock,
you will notice a bunch of errors where something is trying to access Mutex when another asynchronous function blocks it.
I'm not really convinced that this is an active issue, and refactoring code to async aware Mutex like this or this might be a fix?
Why you change
.lock()
to.try_unlock
you will notice a bunch of errors where something is trying to access Mutex when it is blocked by another asynchronous function.
I don't doubt that there could be some unlock errors using try_lock
, especially when the actors are starting, but these errors are transient and just indicate contention on the Mutex
. It doesn't indicate to me that the Mutex
is at fault. For the Mutex
to be at fault, there would need to be a deadlock that prevents the program from making progress. There are two ways a deadlock could occur:
- One thread acquires the lock twice (not possible, as node code path attempts this)
- There were two or more locks acquired in cyclic way (not possible, as there is only one
Mutex
)
If blocking on the lock explained the actors failing to stop, you would expect the first println!
would run but not the second:
fn stopped(&mut self, ctx: &mut Self::Context) {
println!("A firehose is stopping");
FIREHOSES.lock().unwrap().remove(&ctx.address().recipient());
println!("A firehose stopped");
}
However, when the data size is set to 10000
, neither println!
runs and the actors stay running. While running, they don't touch the mutex except when starting (they all start fine) and stopping (they don't even try to stop). It doesn't matter if the sender blocks, as it is running it its own spawned thread, but I have also verified it doesn't block for significant time (it sends to an unbounded queue).
I've updated my code to use the latest version of actix* and the issue is still reproducible.
I'm not really convinced that this is an actix issue and refactoring code to async aware mutex like this or this.
According to the async mutex documentation, the main benefit is that you can hold them across await
points, which my code isn't doing.
@finnbear thank you for all your time. I was experiencing different issues, but your case still seems to be a bug.
@finnbear thank you for all your time. I was experiencing different issues, but your case still seems to be a bug.
No problem, glad you were able to resolve your issue! :+1:
Since the actix
actor framework is in maintenance mode and the workaround for this issue is clear and reasonable, I'm going to close this.
We are exploring a non-actor alternative to WebSockets to be built-in to Actix Web in the near future.
For now the actix-ws
crate is a good alternative. Examples of it's use can be found here under the *-actorless directories.