nats.rs icon indicating copy to clipboard operation
nats.rs copied to clipboard

Request hang issues during connection.request / connection.publish when using TLS and larger payloads

Open choerst-ibm opened this issue 2 years ago • 3 comments

For debugging convenience, I created a minimal project containing all configs, scripts and code here: https://github.com/choerst-ibm/nats-repro

See README.md there on how to run it most easily.

NATS version (grep 'name = "nats"' Cargo.lock -A 1)

0.18.1

rustc version (rustc --version - we support Rust 1.41 and up)

rustc 1.56.1

OS/Container environment:

TLS certificates and the respective CA were pre-generated to have localhost and the host.docker.internal and nats-main.example.com as valid DNS altnames. See the linked repo up top for details if you want to generate them in the same way.

Running a local NATS server in docker with

docker run -d --name nats-main.example.com -v $(pwd)/certs/test-server.key:/test-server.key \
-v $(pwd)/certs/test-server.pem:/test-server.pem -v $(pwd)/nats.conf:/nats-server.conf \
-p 4222:4222 -p 6222:6222 -p 8222:8222 -p 4242:4242 nats -c nats-server.conf

nats-server.conf

port: 4222

debug: true
trace: true
max_traced_msg_len: 100

write_deadline = "10s"
max_payload    = 64000000
max_pending    = 256000000

monitor_port: 8222

authorization {
  token: payloadtesttoken
  timeout: 0.75
}

cluster {
  name: "payload_test_cluster"
  port: 6222
  routes = []
}

tls: {
  cert_file: "./test-server.pem"
  key_file: "./test-server.key"
  timeout: 1
}

Running a nats-replier in docker with:

docker run --rm -v $PWD:/workspace -it --add-host host.docker.internal:host-gateway synadia/nats-box

In said container:

nats reply --tlsca /workspace/certs/test-ca.pem -s tls://[email protected]:4222 payload.test Hello > /dev/null

Steps or code to reproduce the issue:

use env_logger::Env;
use std::thread;
use std::time::{Duration, SystemTime};
use log::{debug, error, info};

fn send_nats_payload(subject: &str, payload: &str, connection: &nats::Connection) {
    let now = SystemTime::now();
    info!("Sending payload of size '{:?}' bytes to NATS on topic '{}'", payload.len(), subject);
    match connection.request(subject, payload) {
        Ok(response) => {
            info!("Successfully submitted message to subject '{}' and received result", subject);
            info!("Time taken to send request and receive reply was {} milliseconds", now.elapsed().unwrap().as_millis());
            debug!("Received response message '{:?}'", response.data);
        }
        Err(error) => {
            let error_message =
                format!("Error during publish to subject '{}': {:?}", subject, error);
            error!("{}", error_message);
        }
    }
}

fn main() {
    env_logger::Builder::from_env(Env::default().default_filter_or("info")).format_timestamp_nanos().init();

    let payload_size_mb = 15;
    let payload = "T".repeat(payload_size_mb * 1000 * 1000);
    let subject = "payload.test";

    info!("Establishing connection to NATS");
    let connection = nats::Options::with_token("payloadtesttoken")
                .add_root_certificate("test-data/certs/test-ca.pem")
                .connect("tls://localhost:4222")
                .unwrap();

    for run in 1..100 {
        info!("Run {}: Sending payload to NATS", run);
        send_nats_payload(subject, &payload, &connection);

        info!("Waiting 1 seconds between requests");
        thread::sleep(Duration::from_secs(1));
    }
}

Expected result:

Having somewhat predictable req<->response timings, with time taken going up according to payload size without lock-ups. I'd expect something like this (Running with TLS):

[2022-03-31T13:27:46.787019000Z INFO  nats_repro] Establishing connection to NATS
[2022-03-31T13:27:47.471911000Z INFO  nats_repro] Run 1: Sending payload to NATS
[2022-03-31T13:27:47.471971000Z INFO  nats_repro] Sending payload of size '2000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:27:49.732270000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:27:49.732310000Z INFO  nats_repro] Time taken to send request and receive reply was 2260 milliseconds
[2022-03-31T13:27:49.732331000Z INFO  nats_repro] Waiting 1 seconds between requests
[2022-03-31T13:27:50.735865000Z INFO  nats_repro] Run 2: Sending payload to NATS
[2022-03-31T13:27:50.735927000Z INFO  nats_repro] Sending payload of size '2000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:27:50.867102000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:27:50.867140000Z INFO  nats_repro] Time taken to send request and receive reply was 131 milliseconds
[2022-03-31T13:27:50.867160000Z INFO  nats_repro] Waiting 1 seconds between requests
[2022-03-31T13:27:51.871206000Z INFO  nats_repro] Run 3: Sending payload to NATS
[2022-03-31T13:27:51.871267000Z INFO  nats_repro] Sending payload of size '2000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:27:52.003510000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:27:52.003542000Z INFO  nats_repro] Time taken to send request and receive reply was 132 milliseconds
[2022-03-31T13:27:52.003558000Z INFO  nats_repro] Waiting 1 seconds between requests
[2022-03-31T13:27:53.007835000Z INFO  nats_repro] Run 4: Sending payload to NATS
[2022-03-31T13:27:53.007874000Z INFO  nats_repro] Sending payload of size '2000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:27:53.143732000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:27:53.143764000Z INFO  nats_repro] Time taken to send request and receive reply was 135 milliseconds
[2022-03-31T13:27:53.143781000Z INFO  nats_repro] Waiting 1 seconds between requests

Running without TLS, even larger payloads aren't stuck and have comparable run times:

[2022-03-31T16:49:21.475659000Z INFO  nats_repro] Establishing connection to NATS
[2022-03-31T16:49:22.326724000Z INFO  nats_repro] Run 1: Sending payload to NATS
[2022-03-31T16:49:22.326776000Z INFO  nats_repro] Sending payload of size '15000000' bytes to NATS on topic 'payload.test'
[2022-03-31T16:49:23.310846000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T16:49:23.310899000Z INFO  nats_repro] Time taken to send request and receive reply was 984 milliseconds
[2022-03-31T16:49:23.310940000Z INFO  nats_repro] Waiting 1 seconds between requests
[2022-03-31T16:49:24.315603000Z INFO  nats_repro] Run 2: Sending payload to NATS
[2022-03-31T16:49:24.315649000Z INFO  nats_repro] Sending payload of size '15000000' bytes to NATS on topic 'payload.test'
[2022-03-31T16:49:25.515522000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T16:49:25.515590000Z INFO  nats_repro] Time taken to send request and receive reply was 1199 milliseconds

Actual result:

There seem to be lock-up issues that are only seen when running with TLS with larger NATS payloads. A large payload (15MB) is used to force this problem reliably here, I've also seen the issue intermittendly with smaller (>2MB) payload sizes.

[2022-03-31T13:13:18.061529000Z INFO  nats_repro] Establishing connection to NATS
[2022-03-31T13:13:19.135026000Z INFO  nats_repro] Run 1: Sending payload to NATS
[2022-03-31T13:13:19.135070000Z INFO  nats_repro] Sending payload of size '15000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:13:22.193753000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:13:22.193812000Z INFO  nats_repro] Time taken to send request and receive reply was 3058 milliseconds
[2022-03-31T13:13:22.193893000Z INFO  nats_repro] Waiting 1 seconds between requests
[2022-03-31T13:13:23.197879000Z INFO  nats_repro] Run 2: Sending payload to NATS
[2022-03-31T13:13:23.197932000Z INFO  nats_repro] Sending payload of size '15000000' bytes to NATS on topic 'payload.test'
[2022-03-31T13:17:21.792501000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-03-31T13:17:21.792535000Z INFO  nats_repro] Time taken to send request and receive reply was 238594 milliseconds
[2022-03-31T13:17:21.792551000Z INFO  nats_repro] Waiting 1 seconds between requests

After the first request, the second request will take close to 4 minutes to complete (In this time, the nats server seems to be receiving no payload data in that time (nats-server log):

[1] 2022/03/31 13:13:21.785444 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- MSG_PAYLOAD: ["TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT..."]
[1] 2022/03/31 13:13:21.788880 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - ->> [MSG payload.test 1 _INBOX.mrtNZiFtiuFvmpwHUZaj0v 15000000]
[1] 2022/03/31 13:13:21.864693 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- [PONG]
[1] 2022/03/31 13:13:22.218296 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - <<- [PUB _INBOX.mrtNZiFtiuFvmpwHUZaj0v 5]
[1] 2022/03/31 13:13:22.218357 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - <<- MSG_PAYLOAD: ["Hello"]
[1] 2022/03/31 13:13:22.218383 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - ->> [MSG _INBOX.mrtNZiFtiuFvmpwHUZaj0v 1 5]
[1] 2022/03/31 13:13:22.220596 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- [UNSUB 1]
[1] 2022/03/31 13:13:22.220630 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <-> [DELSUB 1]
[1] 2022/03/31 13:13:23.227143 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- [SUB _INBOX.mrtNZiFtiuFvmpwHUZaj2O 2]
[1] 2022/03/31 13:13:23.227295 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- [PUB payload.test _INBOX.mrtNZiFtiuFvmpwHUZaj2O 15000000]
[1] 2022/03/31 13:14:39.931255 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - <<- [PING]
[1] 2022/03/31 13:14:39.931285 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - ->> [PONG]
[1] 2022/03/31 13:14:42.044169 [DBG] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - Client Ping Timer
[1] 2022/03/31 13:14:42.044297 [DBG] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - Delaying PING due to client activity 1m20s ago
[1] 2022/03/31 13:15:21.224589 [DBG] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - Client Ping Timer
[1] 2022/03/31 13:15:21.224639 [DBG] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - Delaying PING due to client activity 1m58s ago
[1] 2022/03/31 13:16:39.795901 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - <<- [PING]
[1] 2022/03/31 13:16:39.795940 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - ->> [PONG]
[1] 2022/03/31 13:16:41.909175 [DBG] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - Client Ping Timer
[1] 2022/03/31 13:16:41.909276 [DBG] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - Delaying PING due to remote ping 2s ago
[1] 2022/03/31 13:17:21.088415 [DBG] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - Client Ping Timer
[1] 2022/03/31 13:17:21.088509 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - ->> [PING]
[1] 2022/03/31 13:17:21.346264 [TRC] 172.17.0.1:58200 - cid:6 - "v0.18.1:rust" - <<- MSG_PAYLOAD: ["TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT..."]
[1] 2022/03/31 13:17:21.349824 [TRC] 172.17.0.1:58196 - cid:5 - "v1.11.0:go:NATS CLI Version 20210429" - ->> [MSG payload.test 1 _INBOX.mrtNZiFtiuFvmpwHUZaj2O 15000000]

If the issue is not hit with the sample right away, I suggest increasing payload_size_mb in the sample code, this showed wildly different results for me (Theoretically this could go up to 64MB and beyond even). On different systems the payload size leading to a hang also seems to be different (macOS around 4MB, Linux 15MB). I've also observed the same issues using connection.publish.

I suspect the issue could be related to the added locking and waiting mechanism for TLS in connector.rs: https://github.com/nats-io/nats.rs/blob/c8786c3b4924e0df5c4d750b0b9f2680f5608aa0/src/connector.rs#L461

choerst-ibm avatar Mar 31 '22 16:03 choerst-ibm

@choerst-ibm thanks for the extensive report! We will look into this very shortly.

Jarema avatar Apr 01 '22 18:04 Jarema

To add some more context here, this 4 minute hang can be avoided by not re-using connections, which in itself is not great. That connection recreation leads to another problem though, namely that the first request on a connection will have a significant delay very often, even if the payload size is rather small. I adjusted the sample to move the connection creation within the loop itself to simulate this and removed the wait (Sample with 1 MB payload size below):

[2022-04-06T14:23:47.988847000Z INFO  nats_repro] Run 1: Sending payload to NATS
[2022-04-06T14:23:47.988938000Z INFO  nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:48.122026000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:48.122125000Z INFO  nats_repro] Time taken to send request and receive reply was 133 milliseconds
[2022-04-06T14:23:48.859358000Z INFO  nats_repro] Run 2: Sending payload to NATS
[2022-04-06T14:23:48.859435000Z INFO  nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:51.081669000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:51.081746000Z INFO  nats_repro] Time taken to send request and receive reply was 2222 milliseconds
[2022-04-06T14:23:51.977352000Z INFO  nats_repro] Run 3: Sending payload to NATS
[2022-04-06T14:23:51.977402000Z INFO  nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:54.065687000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:54.065726000Z INFO  nats_repro] Time taken to send request and receive reply was 2088 milliseconds
[2022-04-06T14:23:54.803131000Z INFO  nats_repro] Run 4: Sending payload to NATS
[2022-04-06T14:23:54.803198000Z INFO  nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:54.936419000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:54.936465000Z INFO  nats_repro] Time taken to send request and receive reply was 133 milliseconds
[2022-04-06T14:23:56.484969000Z INFO  nats_repro] Run 5: Sending payload to NATS
[2022-04-06T14:23:56.485010000Z INFO  nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:56.586766000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:56.586811000Z INFO  nats_repro] Time taken to send request and receive reply was 101 milliseconds
[2022-04-06T14:23:57.282266000Z INFO  nats_repro] Run 6: Sending payload to NATS
[2022-04-06T14:23:57.282333000Z INFO  nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:23:59.573321000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:23:59.573379000Z INFO  nats_repro] Time taken to send request and receive reply was 2291 milliseconds
[2022-04-06T14:24:00.215526000Z INFO  nats_repro] Run 7: Sending payload to NATS
[2022-04-06T14:24:00.215596000Z INFO  nats_repro] Sending payload of size '1000000' bytes to NATS on topic 'payload.test'
[2022-04-06T14:24:02.469564000Z INFO  nats_repro] Successfully submitted message to subject 'payload.test' and received result
[2022-04-06T14:24:02.469649000Z INFO  nats_repro] Time taken to send request and receive reply was 2254 milliseconds

As you can see, this turns a 100ms send time to around 2s. Combined with the original problem, I think this makes this issue even more severe, unfortunately.

Another observation: The delays seem to coincide with the PING <-> PONG messages being sent on connection establishment.

choerst-ibm avatar Apr 06 '22 16:04 choerst-ibm

Hi @Jarema is there any progress on this? Nice job team!! :)

atilag avatar Apr 27 '22 14:04 atilag