Rocket
Rocket copied to clipboard
Run out of sockets (lots in `CLOSE-WAIT`)
Description
My server has stopped responding to requests. When I logged in to see what was going on, I could see lots of errors that looked like hacking attempts (e.g. GET /.env
) and also lots of error messages that I've subsequently traced to rustls
(received corrupt message of type Handshake
), which again are malformed packets that are probably targeted at openssl.
When I look at the state of sockets (ss -t
) I see loads of sockets open on CLOSE-WAIT
(and a few on ESTAB
). I wonder if the socket is never closed when the client doesn't send the right close messages (which would enable a DDOS attack). I wonder if there's an option we should be setting on the OS socket that would timeout and close sockets that are on CLOSE-WAIT for long enough. It could be something else - I'm not an expert in this area.
To Reproduce
I'm sorry but I'm not sure how to reproduce the error. I don't have a low-level log of exactly what the TLS handshake & close looks like. I'm more than happy to run any experiments that would be useful. I can give you the versions of libraries that I'm using..
[dependencies]
diesel = { version = "1.4.8", default-features = false, features = ["uuid", "chrono"] }
rocket = { version = "0.5.0-rc.1", features = ["secrets", "tls", "json"] }
rocket_sync_db_pools = { version = "0.1.0-rc.1", default-features = false, features = ["diesel_postgres_pool"] }
serde = { version = "1.0.136", features = ["derive"] }
uuid = { version = "0.6.5", features = ["v4", "serde"] }
serde_json = "1.0.79"
chrono = { version = "0.4.19", features = ["serde"] }
anyhow = "1.0.54"
Environment:
-
uname -a
:Linux hostname 5.4.0-100-generic #113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- Rocket Version: see dependencies above
Additional Context
Here is some debug output (both snipped because messages are the same)
output of `ss` and Rocket logs
-- rocket logs --
Feb 21 21:25:22 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:25:22 host server[21894]: Error: optimistically retrying now
Feb 21 21:25:25 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:25:25 host server[21894]: Error: optimistically retrying now
Feb 21 21:25:25 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:25:25 host server[21894]: Error: optimistically retrying now
Feb 21 21:25:26 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:25:26 host server[21894]: Error: optimistically retrying now
Feb 21 21:25:26 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:25:26 host server[21894]: Error: optimistically retrying now
Feb 21 21:25:26 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:25:26 host server[21894]: Error: optimistically retrying now
Feb 21 21:25:26 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:25:26 host server[21894]: Error: optimistically retrying now
Feb 21 21:25:27 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:25:27 host server[21894]: Error: optimistically retrying now
Feb 21 21:25:27 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:25:27 host server[21894]: Error: optimistically retrying now
Feb 21 21:25:27 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:25:27 host server[21894]: Error: optimistically retrying now
Feb 21 21:43:19 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:43:19 host server[21894]: Error: optimistically retrying now
Feb 21 21:43:19 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:43:19 host server[21894]: Error: optimistically retrying now
Feb 21 21:43:19 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:43:19 host server[21894]: Error: optimistically retrying now
Feb 21 21:43:20 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:43:20 host server[21894]: Error: optimistically retrying now
Feb 21 21:43:20 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:43:20 host server[21894]: Error: optimistically retrying now
Feb 21 21:43:20 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:43:20 host server[21894]: Error: optimistically retrying now
Feb 21 21:43:21 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:43:21 host server[21894]: Error: optimistically retrying now
Feb 21 21:43:21 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:43:21 host server[21894]: Error: optimistically retrying now
Feb 21 21:43:21 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:43:21 host server[21894]: Error: optimistically retrying now
Feb 21 21:43:21 host server[21894]: Error: connection accept error: received corrupt message of type Handshake
Feb 21 21:43:21 host server[21894]: Error: optimistically retrying now
Feb 21 22:17:40 host server[21894]: Error: connection accept error: received corrupt message
Feb 21 22:17:40 host server[21894]: Error: optimistically retrying now
Feb 21 22:17:40 host server[21894]: Error: connection accept error: received corrupt message
Feb 21 22:17:40 host server[21894]: Error: optimistically retrying now
-- output of `ss -ot`
CLOSE-WAIT 183 0 0.my.ip.0:https 193.118.53.194:59762
CLOSE-WAIT 416 0 0.my.ip.0:https 159.203.91.246:40734
CLOSE-WAIT 417 0 0.my.ip.0:https 23.239.10.235:41462
CLOSE-WAIT 139 0 0.my.ip.0:https 45.79.216.179:51300
CLOSE-WAIT 301 0 0.my.ip.0:https 103.78.180.53:57637
CLOSE-WAIT 335 0 0.my.ip.0:https 159.203.91.246:42640
CLOSE-WAIT 221 0 0.my.ip.0:https 130.211.54.158:60272
CLOSE-WAIT 204 0 0.my.ip.0:https 192.3.67.179:47314
CLOSE-WAIT 123 0 0.my.ip.0:https 139.162.145.250:49526
ESTAB 244 0 0.my.ip.0:https 23.96.35.227:60472
CLOSE-WAIT 105 0 0.my.ip.0:https 192.241.201.233:58786
CLOSE-WAIT 105 0 0.my.ip.0:https 192.241.198.70:38632
CLOSE-WAIT 335 0 0.my.ip.0:https 192.241.218.121:57466
ESTAB 0 0 0.my.ip.0:https 222.186.19.235:53586
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.37:47820
CLOSE-WAIT 518 0 0.my.ip.0:https 66.249.66.19:38986
CLOSE-WAIT 123 0 0.my.ip.0:https 60.217.75.69:35568
CLOSE-WAIT 160 0 0.my.ip.0:https 64.62.197.212:44426
ESTAB 244 0 0.my.ip.0:https 23.96.35.227:55962
CLOSE-WAIT 282 0 0.my.ip.0:https 198.17.52.130:57414
CLOSE-WAIT 160 0 0.my.ip.0:https 64.62.197.92:26960
CLOSE-WAIT 351 0 0.my.ip.0:https 188.166.33.116:41896
CLOSE-WAIT 518 0 0.my.ip.0:https 3.145.217.144:34732
CLOSE-WAIT 105 0 0.my.ip.0:https 192.241.219.171:43058
CLOSE-WAIT 293 0 0.my.ip.0:https 124.57.42.113:56430
ESTAB 244 0 0.my.ip.0:https 20.120.32.30:51659
CLOSE-WAIT 183 0 0.my.ip.0:https 128.14.209.146:47751
CLOSE-WAIT 518 0 0.my.ip.0:https 44.203.91.84:55122
CLOSE-WAIT 350 0 0.my.ip.0:https 185.174.28.39:53682
CLOSE-WAIT 419 0 0.my.ip.0:https 207.154.204.132:51636
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.37:40500
CLOSE-WAIT 139 0 0.my.ip.0:https 45.79.216.179:42006
CLOSE-WAIT 1 0 0.my.ip.0:https 23.129.64.211:21750
CLOSE-WAIT 1 0 0.my.ip.0:https 45.153.160.137:39160
ESTAB 244 0 0.my.ip.0:https 20.120.32.30:64493
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.168:7598
CLOSE-WAIT 419 0 0.my.ip.0:https 188.166.33.116:51370
CLOSE-WAIT 198 0 0.my.ip.0:https 134.122.11.90:60185
CLOSE-WAIT 139 0 0.my.ip.0:https 45.79.216.179:54814
CLOSE-WAIT 244 0 0.my.ip.0:https 154.89.5.80:8738
CLOSE-WAIT 417 0 0.my.ip.0:https 192.241.212.220:34400
CLOSE-WAIT 419 0 0.my.ip.0:https 159.203.34.23:45604
CLOSE-WAIT 1 0 0.my.ip.0:https 45.153.160.137:46285
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.37:38502
CLOSE-WAIT 5 0 0.my.ip.0:https 23.225.163.205:52748
CLOSE-WAIT 351 0 0.my.ip.0:https 167.99.176.87:52608
ESTAB 244 0 0.my.ip.0:https 23.96.35.227:58420
CLOSE-WAIT 351 0 0.my.ip.0:https 68.183.40.13:52634
CLOSE-WAIT 518 0 0.my.ip.0:https 89.187.175.246:34485
ESTAB 244 0 0.my.ip.0:https 20.120.32.30:62169
ESTAB 244 0 0.my.ip.0:https 20.120.32.30:57161
CLOSE-WAIT 1 0 0.my.ip.0:https 134.122.11.90:50849
CLOSE-WAIT 20 0 0.my.ip.0:https 89.248.165.244:35110
CLOSE-WAIT 123 0 0.my.ip.0:https 192.241.218.116:59286
CLOSE-WAIT 378 0 0.my.ip.0:https 46.101.135.113:58590
CLOSE-WAIT 40 0 0.my.ip.0:https 23.236.147.154:53890
ESTAB 0 0 127.0.0.1:44982 127.0.0.1:postgresql timer:(keepalive,117min,0)
ESTAB 244 0 0.my.ip.0:https 20.120.32.30:64694
CLOSE-WAIT 160 0 0.my.ip.0:https 64.62.197.62:59442
CLOSE-WAIT 123 0 0.my.ip.0:https 192.241.215.153:49624
CLOSE-WAIT 432 0 0.my.ip.0:https 188.166.33.116:46248
ESTAB 244 0 0.my.ip.0:https 23.96.35.227:63713
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.168:3610
CLOSE-WAIT 105 0 0.my.ip.0:https 192.241.217.122:49146
CLOSE-WAIT 405 0 0.my.ip.0:https 46.101.135.113:53420
CLOSE-WAIT 378 0 0.my.ip.0:https 188.166.33.116:33084
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.37:51794
CLOSE-WAIT 329 0 0.my.ip.0:https 54.241.251.88:39558
CLOSE-WAIT 293 0 0.my.ip.0:https 124.57.42.113:48066
CLOSE-WAIT 112 0 0.my.ip.0:https 89.248.165.52:37526
CLOSE-WAIT 518 0 0.my.ip.0:https 66.249.66.21:49634
CLOSE-WAIT 428 0 0.my.ip.0:https 165.227.76.114:46866
CLOSE-WAIT 227 0 0.my.ip.0:https 185.220.101.38:7396
CLOSE-WAIT 123 0 0.my.ip.0:https 192.241.218.97:56230
ESTAB 244 0 0.my.ip.0:https 20.120.32.30:59518
ESTAB 244 0 0.my.ip.0:https 20.120.32.30:54206
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.37:48838
ESTAB 244 0 0.my.ip.0:https 23.96.35.227:54342
CLOSE-WAIT 123 0 0.my.ip.0:https 192.241.198.145:45926
CLOSE-WAIT 17 0 0.my.ip.0:https 134.122.11.90:42915
CLOSE-WAIT 163 0 0.my.ip.0:https 20.115.6.6:54746
CLOSE-WAIT 105 0 0.my.ip.0:https 192.241.209.114:59180
CLOSE-WAIT 105 0 0.my.ip.0:https 192.241.211.162:51272
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.37:37206
CLOSE-WAIT 335 0 0.my.ip.0:https 192.241.215.159:45166
CLOSE-WAIT 183 0 0.my.ip.0:https 193.118.53.202:43290
CLOSE-WAIT 349 0 0.my.ip.0:https 165.227.76.114:43434
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.37:50118
CLOSE-WAIT 20 0 0.my.ip.0:https 128.14.209.146:34329
CLOSE-WAIT 264 0 0.my.ip.0:https 34.219.36.208:3408
CLOSE-WAIT 105 0 0.my.ip.0:https 192.241.221.126:53778
CLOSE-WAIT 417 0 0.my.ip.0:https 159.203.91.246:37874
CLOSE-WAIT 183 0 0.my.ip.0:https 128.14.134.134:34836
CLOSE-WAIT 422 0 0.my.ip.0:https 68.183.40.13:52750
CLOSE-WAIT 354 0 0.my.ip.0:https 212.192.241.99:44828
CLOSE-WAIT 143 0 0.my.ip.0:https 45.143.99.69:45506
CLOSE-WAIT 335 0 0.my.ip.0:https 192.241.209.167:33148
CLOSE-WAIT 145 0 0.my.ip.0:https 192.241.212.241:54068
CLOSE-WAIT 139 0 0.my.ip.0:https 45.79.216.179:49124
CLOSE-WAIT 145 0 0.my.ip.0:https 80.82.77.192:32826
CLOSE-WAIT 139 0 0.my.ip.0:https 45.79.216.179:56218
CLOSE-WAIT 518 0 0.my.ip.0:https 213.121.4.11:60225
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.37:48462
CLOSE-WAIT 139 0 0.my.ip.0:https 45.79.216.179:46610
CLOSE-WAIT 518 0 0.my.ip.0:https 54.183.255.27:58054
CLOSE-WAIT 349 0 0.my.ip.0:https 192.241.209.167:38902
CLOSE-WAIT 57 0 0.my.ip.0:https 154.89.5.80:48384
CLOSE-WAIT 518 0 0.my.ip.0:https 114.119.140.107:12580
CLOSE-WAIT 419 0 0.my.ip.0:https 207.154.204.132:56456
CLOSE-WAIT 420 0 0.my.ip.0:https 192.241.215.159:55790
CLOSE-WAIT 518 0 0.my.ip.0:https 114.119.146.141:45472
ESTAB 0 0 127.0.0.1:postgresql 127.0.0.1:44984 timer:(keepalive,119min,0)
CLOSE-WAIT 19 0 0.my.ip.0:https 185.189.182.234:35808
CLOSE-WAIT 244 0 0.my.ip.0:https 45.146.165.37:51110
Can you run the application with debug logging on (ROCKET_LOG_LEVEL=debug
) and report the logs?
I will do as you say - the server only does this one site so it the HD gets filled and the site crashes it's not the end of the world.
Thanks again for all the work on this. I know how hard it is to respond to the quantity of PRs/Issues you must get. 🙂
Hmm weirdly with debug messaging turned on, I don't get the issue. Hate it when this happens. IME when turning on debug messaging solves the problem, it was a timing issue that flushing to the logging sink fixes. I'll leave it running some more to see if it happens eventually.
Right so turning on logging seems to fix the problem. Wierd. I will turn off logging to see if the problem comes back. Otherwise I think I'm going to have to close if I can't even reproduce the error myself.
I'm seeing exactly the same issue. I haven't tried turning on debug logging yet, I'll try that now.
So I turned logging off, and the problem came back. Was ever thus. Would be good if we could get to the bottom of this @Pagten . I think we need to find a way to make the problem occur when we are getting more logging telemetry. Hopefully you will get the error with logging turned on!
So far it looks like enabling debug logging hides the issue for me as well. The server has been up and running for 3 days without issues and running ss -t
doesn't show any connections in the CLOSE-WAIT
state.
Urgh issues that go away in debug mode are the worst.
Just today I saw the issue got triggered again, probably some time over the weekend. This time the server had debug logging enabled. I've redacted the logs to remove some personally identifiable information and attached them to this comment. I've also included the output of running ss -t
. Any redacted information has been replaced with <redacted>
.
I didn't immediately find anything obvious indicating the cause of the issue, but I haven't yet had time to thoroughly look through the debug logs. Maybe someone else can spot something?
Thank you for the logs. I've just pushed a commit that completely reworks the TLS connection implementation. I believe this reworked implementation as well as other relevant upstream fixes may resolve the issue at hand. Would you be able to test against the latest tip?
Would you be able to test against the latest tip?
Yes, I will do so. I'll need a few days though, it's a bit busy at the moment. Should I test with debugging logging enabled (which appears to increase the time it takes for the issue to occur) or disabled?
Would you be able to test against the latest tip?
Yes, I will do so. I'll need a few days though, it's a bit busy at the moment. Should I test with debugging logging enabled (which appears to increase the time it takes for the issue to occur) or disabled?
Same, will test but will need a few days. Will post if there are any issues. I suggest leaving this issue open for a few months, then if neither of us have said we still see the problem, assume it's fixed and close the issue.
Would you be able to test against the latest tip?
Yes, I will do so. I'll need a few days though, it's a bit busy at the moment. Should I test with debugging logging enabled (which appears to increase the time it takes for the issue to occur) or disabled?
It shouldn't matter. If the issue was showing up faster with debugging disabled, that sounds like the right thing to do to confirm (or reestablish) the issue's resolution.
My server has been running revision 761ffb0
since May 6th, with debugging disabled. No issues seen so far...
I've had no problems either, so I will assume this is fixed and close the issue 😄 . I can re-open if it reoccurs or if anyone else experiences it. Thanks very much for fixing this - very much appreciated.
I still have this issue, any ideas? uname -a of the docker container its running in: Linux 8c67bee5b8ce 5.4.0-120-generic #136-Ubuntu SMP Fri Jun 10 13:40:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
[dependencies] rocket = { version = "0.5.0-rc.2", features = ["json"] } rocketjson = "1.2.1" serde = { version = "1.0", features = ["derive", "rc"] } serde_repr = "0.1.7" figment = { version = "0.10.6", features = ["env", "toml", "json"] } validator = { version = "0.14.0", features = ["derive"] } tokio = { version = "1.14.0", features = ["full"] } regex = "1.5.4" bcrypt = "0.10.1" dotenv = "0.15.0" jwt = "0.15.0" hmac = "0.11.0" sha2 = "0.9.8" bincode = "1.3.3" sqlx = { version = "0.5", features = [ "runtime-tokio-rustls", "mysql", "chrono" ] } chrono = { version = "0.4.19", features = ["serde"] } rand = "0.8.4" lettre = "0.10.0-rc.4" rocket_cors = { git = "https://github.com/lawliet89/rocket_cors", branch = "master" }
@Nis5l I'm not sure about the timeline of releases and the fix that @SergioBenitez pushed. Could you try with Rocket set to latest master and see if the issue persists?
For me the issue still persists on the latest master.
[dependencies] #rocket = { version = "0.5.0-rc.2", features = ["json"] } rocket = { git = "https://github.com/SergioBenitez/Rocket", branch = "master", features = ["json"] } #rocketjson = "1.2.1" rocketjson = { git = "https://github.com/Nis5l/rocketjson", branch = "rocket-master" } serde = { version = "1.0", features = ["derive", "rc"] } serde_repr = "0.1.7" figment = { version = "0.10.6", features = ["env", "toml", "json"] } validator = { version = "0.14.0", features = ["derive"] } tokio = { version = "1.14.0", features = ["full"] } regex = "1.5.4" bcrypt = "0.10.1" dotenv = "0.15.0" jwt = "0.15.0" hmac = "0.11.0" sha2 = "0.9.8" bincode = "1.3.3" sqlx = { version = "0.5", features = [ "runtime-tokio-rustls", "mysql", "chrono" ] } chrono = { version = "0.4.19", features = ["serde"] } rand = "0.8.4" lettre = "0.10.0-rc.4" #rocket_cors = { git = "https://github.com/lawliet89/rocket_cors", branch = "master" } #when rocket_db_pools is out, switch
Hmm, I'll leave this open in case anyone else has any ideas. Did you do cargo clean
and cargo update
- perhaps there is something stale in Cargo.lock
? Probably not though.
Should be fine, its building inside a docker container like this:
FROM rust:1.57.0 as builder
WORKDIR server
COPY ./Cargo.toml ./Cargo.toml
COPY ./src/ ./src/
RUN cargo build --release
and I used docker compose build --no-cache
@Nis5l It doesn't look like you're using TLS, which is what this issue was originally about. Or are you perhaps using TLS via sqlx? Nevertheless, is your issue the same, in general? That is, Rocket stops responding to new requests, and upon investigation, there are many sockets in the CLOSE-WAIT stage? If so, a debug log would be useful in isolating the issue. Could you enable debug logging and post a log once the server is stalled?
No, I am not using TLS, the service is running in a docker container behind a reverse proxy. The issue does seem like its the same to me, Rocket stops responding to requests(also from localhost) and many sockets are in the CLOSE-WAIT stage.
This is the full log, its pretty big because it also contains other logs. If you want me to remove the sqlx logs and the few pints I can happily do that, I just thought I would keep them in for now, just in case. Socket information: ss -t
Things I think I've deduced by looking at the logs:
- Rocket is running on
172.24.0.4:80
and the reverse proxy is on172.24.0.10
. - Lots of sockets between Rocket and the proxy are getting stuck in
CLOSE-WAIT
.
So it does look like again there is an issue with handling a misbehaving client during connection closing. I strongly suspect the problem will be somewhere different to this original issue because the fix was specific to TLS.
... But, although there could well be a bug in Rocket (or upstream), there is a good chance you can work around it by changing how the reverse proxy works. There is clearly something specific to your setup (or else everyone would be having your issue) that causes the problem - perhaps the reverse proxy is doing something incorrect, or correct but unusual. If you have control over it, perhaps you could look at how it works and whether it's doing everything it should.
I'd love to offer more specific advice as to what the problem could be, but I'm more of a user of the tokio/hyper
stack than a developer. Fixing these kinds of issue will be important for the stack as it goes into widespread production use. There is a good chance that this issue is lower down than Rocket. It would be great if you could make a very small example in hyper (perhaps with one endpoint) behind your reverse proxy, and then spam it from an automated client, to see if you still get the issue. But I understand that this might not be possible. It would also be good if you could tell us what proxy you are using and its configuration, if that info isn't sensitive.
I tried a minimal rocket setup behind the proxy, but it surprisingly stayed up. Now I will try to figure out what part of the initial application causes the problem, I will let you know as soon as I know.
Thanks
I pretty much gave up a while ago, never found out what part caused the problem, to me it all seemed random. Currently I don't want to take the time to debug it further, but just in case anyone wants to look into this, I am using this proxy which i set up following the docs. And this is a not so minimal, minimal setup, that should cause the problem.
It sounds like the problem has mysteriously disappeared, and that attempts to isolate it have failed. Hopefully this implies that the issue, if any, was somewhere under Rocket and has been fixed. I'm closing this for now. Should anyone see this issue reappear, please do report it.