nats-server
nats-server copied to clipboard
During rolling update in Kubernetes data in KV store is lost in some nodes
We are having a problem with our NATS cluster in a Kubernetes cluster, were the entries in KV stores differ betwenn the nats nodes after server restarts, leaving us with data loss in some cases. As the Kubernetes cluster gets updated every week, this issue was now seen several times, although we cannot reproduce it in a stable manner. Our current workaround is, that if we don't get the data in the first run, we do the same request up to 5 times and then we normally get the wanted data due to the server request routing.
Defect
- [x] Included
nats-server -DV
output - [x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)
Versions of nats-server
and affected client libraries used:
[68] 2023/07/31 10:13:00.901015 [INF] Starting nats-server
[68] 2023/07/31 10:13:00.901069 [INF] Version: 2.9.20
[68] 2023/07/31 10:13:00.901073 [INF] Git: [not set]
[68] 2023/07/31 10:13:00.901079 [DBG] Go build: go1.20.6
[68] 2023/07/31 10:13:00.901091 [INF] Name: ND2HTZ5KWVBYGEJPIJBGHBVKSPXTUSWULKAOKXKSFLQNW23C6RYO7DTC
[68] 2023/07/31 10:13:00.901100 [INF] ID: ND2HTZ5KWVBYGEJPIJBGHBVKSPXTUSWULKAOKXKSFLQNW23C6RYO7DTC
[68] 2023/07/31 10:13:00.901175 [DBG] Created system account: "$SYS"
(We used a custom build from 2.9.20, where we included this specific fix: https://github.com/nats-io/nats-server/issues/4320, Problem also occured with offical 2.9.19 nats-server container)
All client libraries are affected.
OS/Container environment:
Kubernetes containerd
Steps or code to reproduce the issue:
Have the following cluster setup:
- 3 NATS nodes
- Some (~1000) KV stores with replicas set to 3 and persistence set to file
- Write random entries to the KV stores (and keep writing/updating)
- restart one node one by one (rolling update of cluster/nats)
- check all entries exist in all KV stores on all nats servers/nodes, e.g. with requesting multiple times all entries in the same KV Store with the request
nats kv history <keyvaluestore> --trace
(trace option shows the server)
Expected result:
Requesting multiple times all entries in the same kv store should always return the same list.
Actual result:
Sometimes after restarts the result differs between the nodes and stay different.
Additional details
This does not occur after every restart, seem also to depend a little bit on the downtime from the node and for some KVs it seem to heal itself after some time (sometimes minutes, sometime hours, sometime days).
Go code for testscript:
Following script executes listing the entries in the KV store multiple times against the KV store and prints the result. As the NATS server routes the traffic by itself, we don't need to switch servers.
package main
import (
"log"
"os"
"strings"
"github.com/nats-io/nats.go"
)
func connect() nats.JetStreamContext {
url := os.Getenv("NATS_URL")
if url == "" {
url = nats.DefaultURL
}
nc, err := nats.Connect(url)
if err != nil {
log.Fatal(err)
}
js, err := nc.JetStream()
if err != nil {
log.Fatal(err)
}
log.Printf("connected %s", url)
return js
}
func kvNames() []string {
js := connect()
list := make([]string, 0, 1024)
for r := range js.KeyValueStoreNames() {
list = append(list, r[3:])
}
return list
}
func main() {
log.Print("natscheck ...")
list := kvNames()
js := connect()
for i, kvname := range list {
log.Printf("%d/%d check buckets:%s", i, len(list), kvname)
var lastList []string
var entries int = -1
for loop := 0; loop < 10; loop++ {
kv, err := js.KeyValue(kvname)
if err != nil {
log.Fatalf("[%s]:%v", kvname, err)
}
currList, err := kv.Keys()
if err != nil && !strings.Contains(err.Error(), "no keys found") {
log.Fatal(err)
}
//kv.History()
if loop > 1 {
diff1 := difference(currList, lastList)
if len(diff1) > 0 {
log.Printf("Found %d/%d different keys in current, but not last list: %s", len(diff1), len(currList), strings.Join(diff1, ", "))
entries = -1
}
diff2 := difference(lastList, currList)
if len(diff2) > 0 {
log.Printf("Found %d/%d different keys in last, but not current list: %s", len(diff2), len(lastList), strings.Join(diff2, ", "))
entries = -1
}
} else {
entries = len(currList)
}
lastList = currList
}
if entries > 0 {
log.Printf("Found %d same entries", entries)
}
}
}
func difference(a, b []string) []string {
mb := make(map[string]struct{}, len(b))
for _, x := range b {
mb[x] = struct{}{}
}
var diff []string
for _, x := range a {
if _, found := mb[x]; !found {
diff = append(diff, x)
}
}
return diff
}
When doing restarts, do you wait for /healthz to return ok before moving to next server?
Also be good to make sure we can reproduce from top of main, the 2.9.21 RC candidate.
The readiness check waits for /healthz to return ok, and before that we don't server any traffic to that nats server. However, I have the impression, that the new nats server registers itself at the other nats severs before its /healthz returns ok and those ready nats nodes serve requests to it. But that's just an impression from the cpu and memory stats and some logs I had, which I can't say reliably.
For the direct GETS, which all stream peers can participate in yes once they think they are caught up they can participate..
One thing that you could try too is to bump the initialDelaySeconds
to be longer during your rollouts:
liveness:
enabled: true
initialDelaySeconds: 100
https://github.com/nats-io/k8s/blob/f87652b54cbcbee6c32f7a3392982f4f690f0dc2/helm/charts/nats/values.yaml#L51C7-L51C26
Will try the change in the initialDelaySeconds for the next restart next week, we currently want to restart as few times as possible as only the production system hast currently the load where we can monitor this with good data. We are trying to setup another test instance where we can reproduce this behavior, but until we have it finished we have to wait.
Just tested the liveness.initialDelaySeconds
set to 120 but it didn't helped, also for readiness.initialDelaySeconds
.
One thing to note: the state before the test was good, meaning we didn't had any issues as the nats cluster seemed to have healed itself this morning when we made a rolling update to the application which uses the nats.
I've tried to reproduce the behavior with a small go client, but didn't succeed. We can reproduce the behavior with our full java application in a dev environment, but couldn't yet isolate the root cause.
But we had it now multiple times, that the nats cluster synchronized itself after our application was restarted. Is there anything (open connections, special commands) which can prevent the cluster from syncronizing? Especially in the java library?
I've seen, that in the helm chart the service has set publishNotReadyAddresses: true
. Afterwards I found your blog post of the new helm chart release, explicitely mentioning this flag: https://nats.io/blog/nats-helm-chart-1.0.0-rc/
Could this be the reason for this behavior? Since we didn't create an additional service, our client probably connected to a not yet ready nats pod through this service, this would also render the readiness.initialDelaySeconds
useless.
We'll try to upgrade to the latest chart, but this will take some time as we need to thoroughly test the changes before going life.
Keep us posted, you should not be losing any messages of course.
We appear to have come across a very similar situation, in which an upgrade to NATS 2.10.9 has resulted in a loss of the complete history of messages on two R3 streams. Initial investigations suggest this is due to the server nodes being killed during recovery by a failing health check, but we're yet to determine a minimal reproduction (minus Nomad and CSI plugins for persistent disk storage in GCP) which demonstrates the deletion of the message block files in this case.
Any thoughts on additional avenues to explore or whether this sounds like a plausible explanation would be greatly appreciated.
@mpwalkerdine were you using interest streams?
Hi @wallyqs - no, permanent streams with durable consumers. A couple of potentially interesting things appeared in our logs around this time too, Resetting WAL state
and Stream state too short (0 bytes)
:thinking:
A quick update for any future readers - we've narrowed the issue down somewhat and it appears to be related to the NATS disks being pulled at inappropriate times by the CSI plugin, leading to corruption and the log line during next startup:
Detected orphaned stream '[ACC] > [NAME]' will cleanup
, which ultimately causes that node to delete it's message storage on startup.
@wallyqs FYI it looks like one or more of the fixes in 2.10.10 has resolved our issue - I can no longer reproduce the deletion on startup issue, no orphaned stream logs and no propagation of that empty state to other nodes in the cluster :crossed_fingers:
@mpwalkerdine thanks for the update!
Looks like I spoke too soon - more testing shows that one of the nodes has lost its state on disk, and if it becomes a cluster leader that stream reports 0 messages. The files on disk are indeed empty:
-rw-r----- 1 root root 0 Feb 7 09:33 1.blk
-rw------- 1 root root 34 Feb 7 11:53 index.db
Information for Stream REDACTED created 2024-02-06 17:41:58
Description: REDACTED
Subjects: REDACTED.>
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgements: true
Discard Policy: Old
Duplicate Window: 2h0m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: unlimited
Maximum Bytes: 9.3 GiB
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-node-2
Replica: nats-node-0, current, seen 0.53s ago
Replica: nats-node-1, current, seen 0.53s ago
State:
Messages: 0
Bytes: 0 B
FirstSeq: 6,915
LastSeq: 6,914
Active Consumers: 362
For me, the issue seems to be solved with the latest version 2.10.11. We are not able to reproduce it anymore, so it could be closed
@mpwalkerdine Could you let us know if 2.10.11 fixed it for you too?
@mpwalkerdine Could you let us know if 2.10.11 fixed it for you too?
I'll retest against 2.10.11 and report back. Our issue seems to be caused by some pretty unreasonable behaviour in our CSI disk management, so it is perhaps not unexpected for NATS to take issue with that. I did expect the faulty node to recover from the rest of the cluster though.
@Jarema the behaviour has changed, although there still seems to be a discrepancy between nodes when they come back up:
Healthy node as leader:
State:
Messages: 789
Bytes: 5.0 MiB
FirstSeq: 1 @ 2024-02-21T13:29:29 UTC
LastSeq: 789 @ 2024-02-21T13:43:32 UTC
Active Consumers: 321
Number of Subjects: 95
Faulty node as leader:
State:
Messages: 543
Bytes: 2.4 MiB
FirstSeq: 247 @ 2024-02-21T13:42:58 UTC
LastSeq: 789 @ 2024-02-21T13:43:32 UTC
Active Consumers: 321
Number of Subjects: 3
So they're no longer losing all their state, just some of it.
Pertinent logs appear to be:
-
[warn] Detected orphaned stream '$G > REDACTED', will cleanup
-
[warn] Advisory could not be sent for account "$G": system account not setup
Another quick update - we've been running 2.10.11 in production for a week or so now and although it appears the state is no longer lost completely, it does still break the ability to publish messages when they're sent to a node with bad state. Our workaround for this is to scale down to a healthy node and up again and recreate all the consumers (we've got separate idempotency protection).