Reported Memory leak on nats-server
What version were you using?
Last report: v2.9.19
What environment was the server running in?
Cloud Foundry via nats-release
Is this defect reproducible?
We've seen the usage increase in larger installation and issues raised because of that. Below is our pprof output (collected from TLS server) showing the high usage of nats-server.
(pprof) top
Showing nodes accounting for 435.93MB, 98.43% of 442.88MB total
Dropped 42 nodes (cum <= 2.21MB)
Showing top 10 nodes out of 20
flat flat% sum% cum cum%
399.93MB 90.30% 90.30% 409.43MB 92.45% github.com/nats-io/nats-server/v2/server.(*Server).createRoute
11.50MB 2.60% 92.90% 11.50MB 2.60% net/url.parse
10MB 2.26% 95.16% 10MB 2.26% encoding/json.(*decodeState).literalStore
4.50MB 1.02% 96.17% 4.50MB 1.02% fmt.Sprintf
4MB 0.9% 97.08% 9MB 2.03% github.com/nats-io/nats-server/v2/server.(*client).initClient
3.50MB 0.79% 97.87% 3.50MB 0.79% github.com/nats-io/nats-server/v2/server.(*client).removeRemoteSubs
2.50MB 0.56% 98.43% 2.50MB 0.56% sync.NewCond (inline)
0 0% 98.43% 10MB 2.26% encoding/json.(*decodeState).object
0 0% 98.43% 10MB 2.26% encoding/json.(*decodeState).unmarshal
0 0% 98.43% 10MB 2.26% encoding/json.(*decodeState).value
Given the capability you are leveraging, describe your expectation?
We expect the total memory usage to go down.
Given the expectation, what is the defect you are observing?
Memory usage remains high until the VM is restarted and it will repeat this pattern.
Hi @winkingturtle-vmw do you see many slow consumers often in the cluster which are of route type (rid in the logs)? Could you point to the common nats-server configuration from the nats-release?
@wallyqs I have not seen any sings of slow consumer and didn't find that in the logs. Here is our nats config:
net: "10.0.4.5"
port: 4222
prof_port: 0
http: "0.0.0.0:0"
write_deadline: "2s"
debug: false
trace: false
logtime: true
authorization {
user: "nats"
password: "<PASSWORD>"
timeout: 15
}
cluster {
no_advertise: false
host: "10.0.4.5"
port: 4223
authorization {
user: "nats"
password: "<PASSWORD>"
timeout: 15
}
tls {
ca_file: "<PATH_TO_FILE>"
cert_file: "<PATH_TO_FILE>"
key_file: "<PATH_TO_FILE>"
cipher_suites: [
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
]
curve_preferences: [
"CurveP384"
]
timeout: 5 # seconds
verify: true
}
routes = [
<ROUTES>
]
}
no_sys_acc: true
Here's the pprof graph of the issue:
It seems there is one createRoute call that's ballooning larger than expected.
Hello all. I'm from the same team as @MarcPaquette and @winkingturtle-vmw. We're pretty confident that the root cause was a duplicate IP address. Specifically, another machine on the network had the same IP as one of the NATS servers. Something about that set up caused the NATS clustering to balloon in memory.
To reproduce the issue, we deployed a VM in the same network as our NATS cluster. Then we sshed onto the VM and ran ifconfig eth0 <IP ADDRESS OF ONE NATS NODE> netmask 255.255.255.192 up. We had to leave the setup running overnight, but by the next morning, we were seeing the same kind of memory leak.
On our side of things, we were able to resolve the issue by eliminating the duplicate IP. This seems like an opportunity to make NATS more robust to networking errors, but I'd also understand if the NATS team decided that it's an edge case not worth handling. I'll leave it to y'all to decide to address this issue or close it.