nats-server Reported Memory leak on nats-server

What version were you using?

Last report: v2.9.19

What environment was the server running in?

Cloud Foundry via nats-release

Is this defect reproducible?

We've seen the usage increase in larger installation and issues raised because of that. Below is our pprof output (collected from TLS server) showing the high usage of nats-server.

(pprof) top
Showing nodes accounting for 435.93MB, 98.43% of 442.88MB total
Dropped 42 nodes (cum <= 2.21MB)
Showing top 10 nodes out of 20
      flat  flat%   sum%        cum   cum%
  399.93MB 90.30% 90.30%   409.43MB 92.45%  github.com/nats-io/nats-server/v2/server.(*Server).createRoute
   11.50MB  2.60% 92.90%    11.50MB  2.60%  net/url.parse
      10MB  2.26% 95.16%       10MB  2.26%  encoding/json.(*decodeState).literalStore
    4.50MB  1.02% 96.17%     4.50MB  1.02%  fmt.Sprintf
       4MB   0.9% 97.08%        9MB  2.03%  github.com/nats-io/nats-server/v2/server.(*client).initClient
    3.50MB  0.79% 97.87%     3.50MB  0.79%  github.com/nats-io/nats-server/v2/server.(*client).removeRemoteSubs
    2.50MB  0.56% 98.43%     2.50MB  0.56%  sync.NewCond (inline)
         0     0% 98.43%       10MB  2.26%  encoding/json.(*decodeState).object
         0     0% 98.43%       10MB  2.26%  encoding/json.(*decodeState).unmarshal
         0     0% 98.43%       10MB  2.26%  encoding/json.(*decodeState).value

Given the capability you are leveraging, describe your expectation?

We expect the total memory usage to go down.

Given the expectation, what is the defect you are observing?

Memory usage remains high until the VM is restarted and it will repeat this pattern.

Oct 03 '23 14:10 winkingturtle-vmw

Hi @winkingturtle-vmw do you see many slow consumers often in the cluster which are of route type (rid in the logs)? Could you point to the common nats-server configuration from the nats-release?

Oct 03 '23 15:10 wallyqs

@wallyqs I have not seen any sings of slow consumer and didn't find that in the logs. Here is our nats config:

net: "10.0.4.5"
port: 4222
prof_port: 0
http: "0.0.0.0:0"
write_deadline: "2s"

debug: false
trace: false
logtime: true

authorization {
  user: "nats"
  password: "<PASSWORD>"
  timeout: 15
}

cluster {
  no_advertise: false
  host: "10.0.4.5"
  port: 4223

  authorization {
    user: "nats"
    password: "<PASSWORD>"
    timeout: 15
  }


  tls {
    ca_file: "<PATH_TO_FILE>"
    cert_file: "<PATH_TO_FILE>"
    key_file: "<PATH_TO_FILE>"
    cipher_suites: [
      "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
    ]
    curve_preferences: [
      "CurveP384"
    ]
    timeout: 5 # seconds
    verify: true
  }


  routes = [
    <ROUTES>
  ]
}

no_sys_acc: true

Oct 04 '23 12:10 winkingturtle-vmw

Here's the pprof graph of the issue:

It seems there is one createRoute call that's ballooning larger than expected.

Oct 12 '23 19:10 MarcPaquette

Hello all. I'm from the same team as @MarcPaquette and @winkingturtle-vmw. We're pretty confident that the root cause was a duplicate IP address. Specifically, another machine on the network had the same IP as one of the NATS servers. Something about that set up caused the NATS clustering to balloon in memory.

To reproduce the issue, we deployed a VM in the same network as our NATS cluster. Then we sshed onto the VM and ran ifconfig eth0 <IP ADDRESS OF ONE NATS NODE> netmask 255.255.255.192 up. We had to leave the setup running overnight, but by the next morning, we were seeing the same kind of memory leak.

On our side of things, we were able to resolve the issue by eliminating the duplicate IP. This seems like an opportunity to make NATS more robust to networking errors, but I'd also understand if the NATS team decided that it's an edge case not worth handling. I'll leave it to y'all to decide to address this issue or close it.

Nov 02 '23 22:11 dsabeti