nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

Memory leak of nats cluster

Open Steel551454 opened this issue 1 year ago • 45 comments

Observed behavior

We have a NATS cluster of three nodes (NATS version is 2.10.16).

host: 127.0.0.1
port: 4222

server_name: nats-02-cluster
accounts {
 $SYS { users = [ { user: "nats", pass: "PASS" } ] }
}

jetstream {
  store_dir=/var/lib/nats
  max_memory_store: 1024Mb
  max_file_store: 819200Mb
}

cluster {
  name: cluster
  listen: 127.0.0.1:6222
  routes: [
    nats-route://nats-00-cluster:6226
    nats-route://nats-01-cluster:6226
    nats-route://nats-02-cluster:6226
  ]
  compression: {
    mode: s2_auto
    rtt_thresholds: [10ms, 50ms, 100ms]
  }
}

http_port: 8222
max_connections: 64K
max_control_line: 4KB
max_payload: 8MB
max_pending: 64MB
max_subscriptions: 0
log_file: /var/log/nats/na29ts-server.log

Cluster interaction occurs via nginx:

upstream nats {
    server 127.0.0.1:4222;
}

server {
    listen        127.0.0.1:4224 so_keepalive=1m:5s:2;
    listen        192.168.1.2:4224 so_keepalive=1m:5s:2;

    access_log    off;
    tcp_nodelay   on;
    preread_buffer_size 64k;
    proxy_pass    nats;
}

upstream nats-cluster {
    server 127.0.0.1:6222;
}

server {
    listen        127.0.0.1:6226 so_keepalive=1m:5s:2;
    listen        192.168.1.2:6226 so_keepalive=1m:5s:2;

    access_log    off;
    tcp_nodelay   on;
    preread_buffer_size 64k;
    proxy_pass    nats-cluster;
}

Events are forwarded to NATS with Vector service. The average throughput is 80k events per second (or 90 MB/s).

  nats:
    type: "nats"
    inputs:
      - "upstreams.other"
    url: "nats://127.0.0.1:4222"
    request:
      rate_limit_num: 70000
    buffer:
      type: memory
      max_events: 2000
    subject: "{{ type }}"
    acknowledgements:
      enabled: true
    encoding:
      codec: json

Memory usage is continuously increasing and reaches host limit (60 GB) and OOM killer happens to NATS service as a result. NATS profile can be found in attachments. profiles.tar.gz

Expected behavior

Service memory should not leak

Server and client version

nats-server: 2.10.16 nats: 0.1.4

Host environment

No response

Steps to reproduce

No response

Steel551454 avatar Jun 11 '24 14:06 Steel551454

Thanks for providing the memory profiles!

Can you please try disabling route compression by changing mode from s2_auto to off and see if there's an improvement?

neilalexander avatar Jun 11 '24 14:06 neilalexander

we did it. no changes followed

Steel551454 avatar Jun 11 '24 15:06 Steel551454

I removed nginx and now the nodes communicate with each other directly. Memory continues to leak.

Steel551454 avatar Jun 12 '24 06:06 Steel551454

profiles.zip it's current profiles

Steel551454 avatar Jun 12 '24 06:06 Steel551454

Your latest profile suggests there are still a lot of allocations in the route S2 writer, are you sure route compression was disabled properly? You may need to do a rolling restart of the cluster nodes to ensure it's taken effect.

neilalexander avatar Jun 12 '24 08:06 neilalexander

You are right: I forgot to turn off compression on one server

Steel551454 avatar Jun 12 '24 11:06 Steel551454

profiles.zip

After we disabled nginx and removed compression, memory continues to leak.

Steel551454 avatar Jun 12 '24 12:06 Steel551454

OK, this latest profile shows a different type of memory build-up to before (this one shows Raft append entries, last time that wasn't evident).

Can you please post more details about your cluster? What spec of machines are the cluster nodes running on? Are all of the cluster nodes the same CPU/RAM/disk-wise? Do you see these build-ups on a single node or multiple?

neilalexander avatar Jun 12 '24 12:06 neilalexander

The Nats cluster is running on servers with the following specifications: 64GB RAM, Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz, 890GB SSD. All servers are identical. We use 10GE network cards. The operating system is Arch Linux. Memory usage on the servers is uneven. The node with the most primary replicas consumes memory to a greater extent.

Steel551454 avatar Jun 12 '24 13:06 Steel551454

Do you async publish for JetStream?

derekcollison avatar Jun 12 '24 13:06 derekcollison

Honestly, I'm not sure how this is implemented in vector.dev. Here is a link to the module: https://vector.dev/docs/reference/configuration/sinks/nats/ https://github.com/vectordotdev/vector/tree/master/src/sinks/nats

Steel551454 avatar Jun 12 '24 13:06 Steel551454

I reviewed the source code of the NATS module and saw the function call to async_nats.

Steel551454 avatar Jun 13 '24 06:06 Steel551454

Maybe we can have @Jarema take a look since its using the rust client.

derekcollison avatar Jun 13 '24 11:06 derekcollison

@derekcollison A quick glance shows that vector is using Core NATS publish, so not even JetStream async publish.

Jarema avatar Jun 13 '24 12:06 Jarema

ok very easy to overload the system in that case.. This will balloon up the internal append entries since that pipeline needs to interbally queue then write to the store.

derekcollison avatar Jun 13 '24 12:06 derekcollison

Will you fix this? Or do we need to make changes on our end?

Steel551454 avatar Jun 13 '24 12:06 Steel551454

The issue needs to be rectified in Vector by switching from Core NATS publishes to JetStream publishes, as currently the Core NATS publishes can potentially send data into JetStream faster than it can be processed. This explains the build-up of append entries in memory that you are seeing.

It looks like there's already an issue tracking this on their repository: https://github.com/vectordotdev/vector/issues/10534

neilalexander avatar Jun 13 '24 12:06 neilalexander

We will die faster than the task above will be completed :) (It was created in 2021)

Is there any chance that you could create some kind of var to limit the rate of JetStream forwarding?

Steel551454 avatar Jun 13 '24 12:06 Steel551454

We would not approach it that way, we should not slow down normal NATS core publishers due to misconfiguration.

We are considering a way to protect the server by dropping AppendEntry msgs from the NRG (raft) layer. That would avoid memory bloat but would cause the system to thrash a bit catching up the NRG followers when they detect gaps from the dropped messages.

derekcollison avatar Jun 13 '24 12:06 derekcollison

@Steel551454 I plan to contribute to the issue mentioned somewhere in Q3, and introducel JetStream support. The current one is not actually supporting acks, despite saying that in docs.

Jarema avatar Jun 13 '24 12:06 Jarema

Let's say we turn off Jetstream. Where in NATS settings can we specify where to store events?

Steel551454 avatar Jun 13 '24 12:06 Steel551454

If you turn off JetStream, messages will not be stored anywhere, they become at most once. You need a subscriber application that processes them.

In JetStream, you can define the store directiory here:

jetstream {
  store_dir: /path
}

or via providing -sd flag.

Jarema avatar Jun 13 '24 13:06 Jarema

Do you happen to have a simple Go-written proxy: transforming requests from NATS Core Stream to Jetstream?

Steel551454 avatar Jun 13 '24 13:06 Steel551454

Today we replaced the pipeline parser vetor.dev with redpanda-connect, which has a plugin for working with NATS Jetstream. The memory leak issue has not been resolved. Attached is an archive with profiles. profiles.zip

Steel551454 avatar Jun 14 '24 10:06 Steel551454

@derekcollison, I'm sorry to bother you, but switching our pipeline to use Jetstream did not solve the memory leak issue. Maybe we should add some explicit limiter? The situation where a cluster node crashes due to OOM cannot be considered good.

Steel551454 avatar Jun 15 '24 10:06 Steel551454

Agree, we could simply drop messages, and not place them into the stream. The system will be complaining about high lag getting messages into the stream. Those should be in the log.

However in this case, I would imagine you want the system to store the messages. So you either need to slow down the publisher or speed up the storage mechanism. Meaning running multiple parallet streams and have the NATS system transparently partition the subject space into multiple streams.

derekcollison avatar Jun 15 '24 13:06 derekcollison

Do I understand correctly that in our case, for faster message storage in streams, we should launch multiple instances (preferably on different servers) and distribute the streams among several instances?

Steel551454 avatar Jun 15 '24 13:06 Steel551454

Or do you have something else in mind?

Steel551454 avatar Jun 15 '24 13:06 Steel551454

And another question: would the memory leak situation change if we used NVMe disks instead of traditional SSDs to store the events?

Steel551454 avatar Jun 15 '24 13:06 Steel551454

Yes that is correct. @jnmoyne can help with how that gets put together.

The memory leak is not a leak, since the publishing layer does not wait and publishes as fast as NATS core allows (NATS core can be >10M/s vs JetStream around ~250k/s), the system is simply holding onto all the staged messages waiting to be stored into the stream.

NVMe probably would not make a difference in this case IMO.

derekcollison avatar Jun 15 '24 14:06 derekcollison