nats.rs
nats.rs copied to clipboard
Request hang issues during connection.request / connection.publish when using TLS and larger payloads
For debugging convenience, I created a minimal project containing all configs, scripts and code here: https://github.com/choerst-ibm/nats-repro
See README.md there on how to run it most easily.
NATS version (grep 'name = "nats"' Cargo.lock -A 1
)
0.18.1
rustc version (rustc --version
- we support Rust 1.41 and up)
rustc 1.56.1
OS/Container environment:
TLS certificates and the respective CA were pre-generated to have localhost
and the host.docker.internal
and nats-main.example.com
as valid DNS altnames. See the linked repo up top for details if you want to generate them in the same way.
Running a local NATS server in docker with
docker run -d --name nats-main.example.com -v $(pwd)/certs/test-server.key:/test-server.key \
-v $(pwd)/certs/test-server.pem:/test-server.pem -v $(pwd)/nats.conf:/nats-server.conf \
-p 4222:4222 -p 6222:6222 -p 8222:8222 -p 4242:4242 nats -c nats-server.conf
nats-server.conf
port: 4222
debug: true
trace: true
max_traced_msg_len: 100
write_deadline = "10s"
max_payload = 64000000
max_pending = 256000000
monitor_port: 8222
authorization {
token: payloadtesttoken
timeout: 0.75
}
cluster {
name: "payload_test_cluster"
port: 6222
routes = []
}
tls: {
cert_file: "./test-server.pem"
key_file: "./test-server.key"
timeout: 1
}
Running a nats-replier in docker with:
docker run --rm -v $PWD:/workspace -it --add-host host.docker.internal:host-gateway synadia/nats-box
In said container:
nats reply --tlsca /workspace/certs/test-ca.pem -s tls://[email protected]:4222 payload.test Hello > /dev/null
Steps or code to reproduce the issue:
use env_logger::Env;
use std::thread;
use std::time::{Duration, SystemTime};
use log::{debug, error, info};
fn send_nats_payload(subject: &str, payload: &str, connection: &nats::Connection) {
let now = SystemTime::now();
info!("Sending payload of size '{:?}' bytes to NATS on topic '{}'", payload.len(), subject);
match connection.request(subject, payload) {
Ok(response) => {
info!("Successfully submitted message to subject '{}' and received result", subject);
info!("Time taken to send request and receive reply was {} milliseconds", now.elapsed().unwrap().as_millis());
debug!("Received response message '{:?}'", response.data);
}
Err(error) => {
let error_message =
format!("Error during publish to subject '{}': {:?}", subject, error);
error!("{}", error_message);
}
}
}
fn main() {
env_logger::Builder::from_env(Env::default().default_filter_or("info")).format_timestamp_nanos().init();
let payload_size_mb = 15;
let payload = "T".repeat(payload_size_mb * 1000 * 1000);
let subject = "payload.test";
info!("Establishing connection to NATS");
let connection = nats::Options::with_token("payloadtesttoken")
.add_root_certificate("test-data/certs/test-ca.pem")
.connect("tls://localhost:4222")
.unwrap();
for run in 1..100 {
info!("Run {}: Sending payload to NATS", run);
send_nats_payload(subject, &payload, &connection);
info!("Waiting 1 seconds between requests");
thread::sleep(Duration::from_secs(1));
}
}
Expected result:
Having somewhat predictable req<->response timings, with time taken going up according to payload size without lock-ups. I'd expect something like this (Running with TLS):
[2022-03-31T13:27:46.787019000Z INFO nats_repro] Establishing connection to NATS
[2022-03-31T13:27:47.471911000Z INFO nats_repro] Run 1: Sending payload to NATS
[2022-03-31T13:27:47.471971000Z INFO nats_repro] Sending payload of size '2000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:27:49.732270000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:27:49.732310000Z INFO nats_repro] Time taken to send request and receive reply was 2260 milliseconds
[2022-03-31T13:27:49.732331000Z INFO nats_repro] Waiting 1 seconds between requests
[2022-03-31T13:27:50.735865000Z INFO nats_repro] Run 2: Sending payload to NATS
[2022-03-31T13:27:50.735927000Z INFO nats_repro] Sending payload of size '2000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:27:50.867102000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:27:50.867140000Z INFO nats_repro] Time taken to send request and receive reply was 131 milliseconds
[2022-03-31T13:27:50.867160000Z INFO nats_repro] Waiting 1 seconds between requests
[2022-03-31T13:27:51.871206000Z INFO nats_repro] Run 3: Sending payload to NATS
[2022-03-31T13:27:51.871267000Z INFO nats_repro] Sending payload of size '2000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:27:52.003510000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:27:52.003542000Z INFO nats_repro] Time taken to send request and receive reply was 132 milliseconds
[2022-03-31T13:27:52.003558000Z INFO nats_repro] Waiting 1 seconds between requests
[2022-03-31T13:27:53.007835000Z INFO nats_repro] Run 4: Sending payload to NATS
[2022-03-31T13:27:53.007874000Z INFO nats_repro] Sending payload of size '2000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:27:53.143732000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:27:53.143764000Z INFO nats_repro] Time taken to send request and receive reply was 135 milliseconds
[2022-03-31T13:27:53.143781000Z INFO nats_repro] Waiting 1 seconds between requests
Running without TLS, even larger payloads aren't stuck and have comparable run times:
[2022-03-31T16:49:21.475659000Z INFO nats_repro] Establishing connection to NATS
[2022-03-31T16:49:22.326724000Z INFO nats_repro] Run 1: Sending payload to NATS
[2022-03-31T16:49:22.326776000Z INFO nats_repro] Sending payload of size '15000000' bytes to NATS on topic 'payload.test'
[2022-03-31T16:49:23.310846000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T16:49:23.310899000Z INFO nats_repro] Time taken to send request and receive reply was 984 milliseconds
[2022-03-31T16:49:23.310940000Z INFO nats_repro] Waiting 1 seconds between requests
[2022-03-31T16:49:24.315603000Z INFO nats_repro] Run 2: Sending payload to NATS
[2022-03-31T16:49:24.315649000Z INFO nats_repro] Sending payload of size '15000000' bytes to NATS on topic 'payload.test'
[2022-03-31T16:49:25.515522000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T16:49:25.515590000Z INFO nats_repro] Time taken to send request and receive reply was 1199 milliseconds
Actual result:
There seem to be lock-up issues that are only seen when running with TLS with larger NATS payloads. A large payload (15MB) is used to force this problem reliably here, I've also seen the issue intermittendly with smaller (>2MB) payload sizes.
[2022-03-31T13:13:18.061529000Z INFO nats_repro] Establishing connection to NATS
[2022-03-31T13:13:19.135026000Z INFO nats_repro] Run 1: Sending payload to NATS
[2022-03-31T13:13:19.135070000Z INFO nats_repro] Sending payload of size '15000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:13:22.193753000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:13:22.193812000Z INFO nats_repro] Time taken to send request and receive reply was 3058 milliseconds
[2022-03-31T13:13:22.193893000Z INFO nats_repro] Waiting 1 seconds between requests
[2022-03-31T13:13:23.197879000Z INFO nats_repro] Run 2: Sending payload to NATS
[2022-03-31T13:13:23.197932000Z INFO nats_repro] Sending payload of size '15000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:17:21.792501000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:17:21.792535000Z INFO nats_repro] Time taken to send request and receive reply was 238594 milliseconds
[2022-03-31T13:17:21.792551000Z INFO nats_repro] Waiting 1 seconds between requests
After the first request, the second request will take close to 4 minutes to complete (In this time, the nats server seems to be receiving no payload data in that time (nats-server log):
[1] 2022/03/31 13:13:21.785444 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- MSG_PAYLOAD: ["TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT..."]
[1] 2022/03/31 13:13:21.788880 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - ->> [MSG payload.test 1 _INBOX.mrtNZiFtiuFvmpwHUZaj0v 15000000]
[1] 2022/03/31 13:13:21.864693 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- [PONG]
[1] 2022/03/31 13:13:22.218296 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - <<- [PUB _INBOX.mrtNZiFtiuFvmpwHUZaj0v 5]
[1] 2022/03/31 13:13:22.218357 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - <<- MSG_PAYLOAD: ["Hello"]
[1] 2022/03/31 13:13:22.218383 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - ->> [MSG _INBOX.mrtNZiFtiuFvmpwHUZaj0v 1 5]
[1] 2022/03/31 13:13:22.220596 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- [UNSUB 1]
[1] 2022/03/31 13:13:22.220630 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <-> [DELSUB 1]
[1] 2022/03/31 13:13:23.227143 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- [SUB _INBOX.mrtNZiFtiuFvmpwHUZaj2O 2]
[1] 2022/03/31 13:13:23.227295 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- [PUB payload.test _INBOX.mrtNZiFtiuFvmpwHUZaj2O 15000000]
[1] 2022/03/31 13:14:39.931255 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - <<- [PING]
[1] 2022/03/31 13:14:39.931285 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - ->> [PONG]
[1] 2022/03/31 13:14:42.044169 [DBG] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - Client Ping Timer
[1] 2022/03/31 13:14:42.044297 [DBG] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - Delaying PING due to client activity 1m20s ago
[1] 2022/03/31 13:15:21.224589 [DBG] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - Client Ping Timer
[1] 2022/03/31 13:15:21.224639 [DBG] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - Delaying PING due to client activity 1m58s ago
[1] 2022/03/31 13:16:39.795901 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - <<- [PING]
[1] 2022/03/31 13:16:39.795940 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - ->> [PONG]
[1] 2022/03/31 13:16:41.909175 [DBG] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - Client Ping Timer
[1] 2022/03/31 13:16:41.909276 [DBG] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - Delaying PING due to remote ping 2s ago
[1] 2022/03/31 13:17:21.088415 [DBG] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - Client Ping Timer
[1] 2022/03/31 13:17:21.088509 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - ->> [PING]
[1] 2022/03/31 13:17:21.346264 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- MSG_PAYLOAD: ["TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT..."]
[1] 2022/03/31 13:17:21.349824 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - ->> [MSG payload.test 1 _INBOX.mrtNZiFtiuFvmpwHUZaj2O 15000000]
If the issue is not hit with the sample right away, I suggest increasing payload_size_mb
in the sample code, this showed wildly different results for me (Theoretically this could go up to 64MB and beyond even). On different systems the payload size leading to a hang also seems to be different (macOS around 4MB, Linux 15MB). I've also observed the same issues using connection.publish.
I suspect the issue could be related to the added locking and waiting mechanism for TLS in connector.rs
: https://github.com/nats-io/nats.rs/blob/c8786c3b4924e0df5c4d750b0b9f2680f5608aa0/src/connector.rs#L461
@choerst-ibm thanks for the extensive report! We will look into this very shortly.
To add some more context here, this 4 minute hang can be avoided by not re-using connections, which in itself is not great. That connection recreation leads to another problem though, namely that the first request on a connection will have a significant delay very often, even if the payload size is rather small. I adjusted the sample to move the connection creation within the loop itself to simulate this and removed the wait (Sample with 1 MB payload size below):
[2022-04-06T14:23:47.988847000Z INFO nats_repro] Run 1: Sending payload to NATS
[2022-04-06T14:23:47.988938000Z INFO nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:48.122026000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:48.122125000Z INFO nats_repro] Time taken to send request and receive reply was 133 milliseconds
[2022-04-06T14:23:48.859358000Z INFO nats_repro] Run 2: Sending payload to NATS
[2022-04-06T14:23:48.859435000Z INFO nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:51.081669000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:51.081746000Z INFO nats_repro] Time taken to send request and receive reply was 2222 milliseconds
[2022-04-06T14:23:51.977352000Z INFO nats_repro] Run 3: Sending payload to NATS
[2022-04-06T14:23:51.977402000Z INFO nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:54.065687000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:54.065726000Z INFO nats_repro] Time taken to send request and receive reply was 2088 milliseconds
[2022-04-06T14:23:54.803131000Z INFO nats_repro] Run 4: Sending payload to NATS
[2022-04-06T14:23:54.803198000Z INFO nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:54.936419000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:54.936465000Z INFO nats_repro] Time taken to send request and receive reply was 133 milliseconds
[2022-04-06T14:23:56.484969000Z INFO nats_repro] Run 5: Sending payload to NATS
[2022-04-06T14:23:56.485010000Z INFO nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:56.586766000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:56.586811000Z INFO nats_repro] Time taken to send request and receive reply was 101 milliseconds
[2022-04-06T14:23:57.282266000Z INFO nats_repro] Run 6: Sending payload to NATS
[2022-04-06T14:23:57.282333000Z INFO nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:59.573321000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:59.573379000Z INFO nats_repro] Time taken to send request and receive reply was 2291 milliseconds
[2022-04-06T14:24:00.215526000Z INFO nats_repro] Run 7: Sending payload to NATS
[2022-04-06T14:24:00.215596000Z INFO nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:24:02.469564000Z INFO nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:24:02.469649000Z INFO nats_repro] Time taken to send request and receive reply was 2254 milliseconds
As you can see, this turns a 100ms send time to around 2s. Combined with the original problem, I think this makes this issue even more severe, unfortunately.
Another observation: The delays seem to coincide with the PING <-> PONG messages being sent on connection establishment.
Hi @Jarema is there any progress on this? Nice job team!! :)