Can't keep Jetstream cluster working when execute rolling upgrade from 2.5.0 to 2.7.4
Versions of nats-server and affected client libraries used:
- nats-server 2.7.4
- nats helm chart 0.14.2
OS/Container environment:
- K8s
Steps or code to reproduce the issue:
- Existing nats 2.5.0 cluster (5 instances) with Jetstream enabled
- Configure
podManagementPolicy: OrderedReady( https://github.com/nats-io/k8s/tree/main/helm/charts/nats#breaking-change-log ) for old helm chart deployment compatible - When execute rolling upgrade to 2.7.4 on ordinal 4 and ordinal 3 pod. Ordinal 2, 1, and 0 (those 3 pods still in old version 2.5.0) nats pod process start panic with following error logs and can't continue to upgrade. NATS cluster can't form a Jetstream meta Group due to only have two servers alive.
[1652] 2022/03/21 02:49:11.632347 [INF] Listening for route connections on [0.0.0.0:6222](http://0.0.0.0:6222/)
[1652] 2022/03/21 02:49:11.633565 [INF]…:6222 - rid:225 - Route connection created
[1652] 2022/03/21 02:49:11.633658 [INF] …:6222 - rid:226 - Route connection created
[1652] 2022/03/21 02:49:11.633736 [INF] …:6222 - rid:227 - Route connection created
[1652] 2022/03/21 02:49:11.633760 [INF] …:6222 - rid:228 - Route connection created
panic: JetStream Cluster Unknown group entry op type! 12
goroutine 284 [running]:
[github.com/nats-io/nats-server/server.(*jetStream).applyConsumerEntries(0xc0002d4000](http://github.com/nats-io/nats-server/server.(*jetStream).applyConsumerEntries(0xc0002d4000), 0xc000445c00, 0xc001ef6940, 0x0, 0x5, 0x1)
/home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2948](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2948) +0x5e7
[github.com/nats-io/nats-server/server.(*jetStream).monitorConsumer(0xc0002d4000](http://github.com/nats-io/nats-server/server.(*jetStream).monitorConsumer(0xc0002d4000), 0xc000445c00, 0xc000cbb0e0)
/home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2874](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2874) +0x65f
[github.com/nats-io/nats-server/server.(*jetStream).processClusterCreateConsumer.func1()](http://github.com/nats-io/nats-server/server.(*jetStream).processClusterCreateConsumer.func1())
/home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2691](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2691) +0x3c
created by [github.com/nats-io/nats-server/server.(*Server).startGoRoutine](http://github.com/nats-io/nats-server/server.(*Server).startGoRoutine)
/home/travis/gopath/src/[github.com/nats-io/nats-server/server/server.go:2867](http://github.com/nats-io/nats-server/server/server.go:2867) +0xc5
About the issue, please let me know whether need to provide more information to clarify root cause, thanks. We can found 2.7.4 have more entryOp type (https://github.com/nats-io/nats-server/blob/v2.7.4/server/jetstream_cluster.go#L87-L88) than 2.5.0 (https://github.com/nats-io/nats-server/blob/v2.5.0/server/jetstream_cluster.go#L81). Is this why old and new versions of nats-server cannot work at the same time? thanks.
@phho could you provide the values files given to helm install of 2.5.0 and helm upgrade of 2.7.4?
like the way this issue being reported
Based on your description, it seems that the first pod being upgraded to 2.7.4 kills other not-upgraded-yet pods with a newly added message type PendingRequests. Interesting, according to the code, NATS carefully handles the situation that a cluster is consist of nodes in different versions. It checks peer's version before sending out messages may be incompatible. It assumes that peer would be soon upgraded to new version and then it could understand the new message.
Latest server release is 2.8.4.
I upgraded NATS from 2.8.2 to 2.8.4 and experienced 5 minutes of downtime. The NATS log is as follows:
[nats-1 nats] [7] 2022/06/15 16:03:05.149963 [ERR] JetStream failed to store a msg on stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': expected sequence does not match store
[nats-0 nats] [9] 2022/06/15 16:04:05.164590 [INF] JetStream cluster new stream leader for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE'
[nats-1 nats] [7] 2022/06/15 16:04:05.153333 [WRN] Error applying entries to 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': last sequence mismatch
[nats-1 nats] [7] 2022/06/15 16:04:05.188851 [WRN] Resetting stream cluster state for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE'
[nats-0 nats] [9] 2022/06/15 16:04:06.224973 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': malformed or corrupt message
[nats-2 nats] [7] 2022/06/15 16:04:09.543781 [INF] JetStream cluster new consumer leader for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE > FUGA'
[nats-1 nats] [7] 2022/06/15 16:04:09.716224 [INF] JetStream cluster new consumer leader for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE > FOO'
[nats-1 nats] [7] 2022/06/15 16:04:16.167494 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-0 nats] [9] 2022/06/15 16:04:16.167943 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-0 nats] [9] 2022/06/15 16:04:26.168326 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-1 nats] [7] 2022/06/15 16:04:26.167825 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-1 nats] [7] 2022/06/15 16:04:36.168886 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-0 nats] [9] 2022/06/15 16:04:36.170278 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-1 nats] [7] 2022/06/15 16:04:46.168983 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-0 nats] [9] 2022/06/15 16:04:46.169485 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-1 nats] [7] 2022/06/15 16:04:56.169085 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-0 nats] [9] 2022/06/15 16:04:56.169566 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-0 nats] [9] 2022/06/15 16:05:05.155487 [ERR] JetStream failed to store a msg on stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': expected sequence does not match store
[nats-0 nats] [9] 2022/06/15 16:05:06.169968 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-1 nats] [7] 2022/06/15 16:05:06.169579 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-1 nats] [7] 2022/06/15 16:05:16.170553 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-0 nats] [9] 2022/06/15 16:05:16.171853 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-1 nats] [7] 2022/06/15 16:05:26.172215 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-0 nats] [9] 2022/06/15 16:05:26.174616 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-1 nats] [7] 2022/06/15 16:05:36.172942 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-0 nats] [9] 2022/06/15 16:05:36.174129 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-0 nats] [9] 2022/06/15 16:05:46.174158 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-1 nats] [7] 2022/06/15 16:05:46.173607 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-1 nats] [7] 2022/06/15 16:05:56.173945 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-0 nats] [9] 2022/06/15 16:05:56.174407 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF
[nats-0 nats] [9] 2022/06/15 16:06:05.158048 [WRN] Error applying entries to 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': last sequence mismatch
[nats-0 nats] [9] 2022/06/15 16:06:05.166286 [WRN] Resetting stream cluster state for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE'
[nats-1 nats] [7] 2022/06/15 16:06:06.174356 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-1 nats] [7] 2022/06/15 16:06:16.174900 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled
[nats-2 nats] [7] 2022/06/15 16:06:17.044857 [INF] JetStream cluster new stream leader for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE'
[nats-2 nats] [7] 2022/06/15 16:06:17.048759 [INF] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' complete
I have NATS installed with helm and my values.yaml is as follows:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/instance: nats
app.kubernetes.io/name: nats
topologyKey: topology.kubernetes.io/zone
weight: 100
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/instance: nats
app.kubernetes.io/name: nats
topologyKey: kubernetes.io/hostname
weight: 1
auth:
enabled: true
resolver:
allowDelete: true
interval: 1m
operator: foo
resolverPreload:
AASU727EDXEETDBIV4GLBLAHR75RIDPGSTDO5WYKHBJRM6G2U2V3BWWR: bar
store:
dir: /accounts/jwt
size: 1Gi
systemAccount: fuga
type: full
cluster:
enabled: true
noAdvertise: true
replicas: 3
nats:
image: nats:2.8.4-alpine
jetstream:
domain: hoge
enabled: true
encryption:
key: hogehoge
fileStorage:
accessModes:
- ReadWriteOnce
enabled: true
size: 10Gi
storageDirectory: /data
memStorage:
enabled: false
podDisruptionBudget:
maxUnavailable: 0
minAvailable: 2
websocket:
enabled: true
noTLS: true
port: 8080
We believe this has been greatly improved in latest versions of the server, 2.9.10 / 2.9.11 and the newest operator.
Feel free to re-open however if needed.
@derekcollison I'm experiencing a similar issue on 2.9.11.
[1340] 2023/01/16 09:41:50.801208 [[32mINF[0m] Starting nats-server
[1340] 2023/01/16 09:41:50.801723 [[32mINF[0m] Version: 2.9.11
[1340] 2023/01/16 09:41:50.802288 [[32mINF[0m] Git: [23ffc16]
[1340] 2023/01/16 09:41:50.802812 [[32mINF[0m] Cluster: nats-cluster-dev
[1340] 2023/01/16 09:41:50.803319 [[32mINF[0m] Name: DKCPH-VLCDNAT1
[1340] 2023/01/16 09:41:50.803443 [[32mINF[0m] Node: MTG7nqV6
[1340] 2023/01/16 09:41:50.803971 [[32mINF[0m] ID: NC5HMVWQHIEEZIDQQ2YXRWC4BUGULQJ43A6R6DLY3QR5JFMGALO7NHAE
[1340] 2023/01/16 09:41:50.803971 [[0;93mWRN[0m] Plaintext passwords detected, use nkeys or bcrypt
[1340] 2023/01/16 09:41:50.804494 [[32mINF[0m] Using configuration file: d:\nats-server\config\server.conf
[1340] 2023/01/16 09:41:50.812024 [[32mINF[0m] Starting http monitor on DKCPH-VLCDNAT1:8222
[1340] 2023/01/16 09:41:50.812939 [[32mINF[0m] Starting JetStream
[1340] 2023/01/16 09:41:50.813466 [[32mINF[0m] _ ___ _____ ___ _____ ___ ___ _ __ __
[1340] 2023/01/16 09:41:50.814620 [[32mINF[0m] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[1340] 2023/01/16 09:41:50.815130 [[32mINF[0m] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[1340] 2023/01/16 09:41:50.815186 [[32mINF[0m] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[1340] 2023/01/16 09:41:50.815716 [[32mINF[0m]
[1340] 2023/01/16 09:41:50.815716 [[32mINF[0m] https://docs.nats.io/jetstream
[1340] 2023/01/16 09:41:50.816261 [[32mINF[0m]
[1340] 2023/01/16 09:41:50.816792 [[32mINF[0m] ---------------- JETSTREAM ----------------
[1340] 2023/01/16 09:41:50.816792 [[32mINF[0m] Max Memory: 3.00 GB
[1340] 2023/01/16 09:41:50.817316 [[32mINF[0m] Max Storage: 30.00 GB
[1340] 2023/01/16 09:41:50.817316 [[32mINF[0m] Store Directory: "d:\nats-server\data\jetstream"
[1340] 2023/01/16 09:41:50.817912 [[32mINF[0m] -------------------------------------------
[1340] 2023/01/16 09:41:50.826315 [[32mINF[0m] Starting JetStream cluster
[1340] 2023/01/16 09:41:50.826830 [[32mINF[0m] Creating JetStream metadata controller
[1340] 2023/01/16 09:41:50.827541 [[32mINF[0m] JetStream cluster recovering state
[1340] 2023/01/16 09:41:50.828575 [[32mINF[0m] Listening for client connections on DKCPH-VLCDNAT1:4222
[1340] 2023/01/16 09:41:50.830395 [[32mINF[0m] Server is ready
[1340] 2023/01/16 09:41:50.830395 [[32mINF[0m] Cluster name is nats-cluster-dev
[1340] 2023/01/16 09:41:50.830995 [[32mINF[0m] Listening for route connections on DKCPH-VLCDNAT1:6222
[1340] 2023/01/16 09:41:50.834923 [[32mINF[0m] 10.110.53.112:6222 - rid:57 - Route connection created
[1340] 2023/01/16 09:41:50.837936 [[32mINF[0m] 10.110.53.113:6222 - rid:63 - Route connection created
[1340] 2023/01/16 09:41:50.945865 [[0;93mWRN[0m] Waiting for routing to be established...
[1340] 2023/01/16 09:41:51.083640 [[32mINF[0m] 10.110.53.112:50914 - rid:87 - Route connection created
[1340] 2023/01/16 09:41:51.086342 [[32mINF[0m] 10.110.53.112:50914 - rid:87 - Router connection closed: Duplicate Route
[1340] 2023/01/16 09:41:51.104360 [[32mINF[0m] 10.110.53.113:56504 - rid:88 - Route connection created
[1340] 2023/01/16 09:41:51.104360 [[32mINF[0m] 10.110.53.113:56504 - rid:88 - Router connection closed: Duplicate Route
[1340] 2023/01/16 09:41:51.349264 [[0;93mWRN[0m] Error applying entries to 'DEV32 > cache-updates': invalid character '\x16' looking for beginning of value
panic: JetStream Cluster Unknown group entry op type!
goroutine 73 [running]:
github.com/nats-io/nats-server/v2/server.(*jetStream).applyStreamEntries(0xc00030e000, 0xc00044e000, 0x0?, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:2464 +0x1e1a
github.com/nats-io/nats-server/v2/server.(*jetStream).monitorStream(0xc00030e000, 0xc00044e000, 0xc0002a8f30, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1902 +0x26e5
github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateStream.func1()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3132 +0x2a
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
/home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3083 +0x85
Additional info: We have a 3-server cluster running on windows with multiple accounts (multi-tenant) for each of our dev environments the streams are memory streams. I haven't been able to figure out what is the root cause of the problem. It usually happens after doing something like this:
- Stop one of the nats-server processes (i.e. stopping a windows service)
- Restart the service so the nats-server node can rejoin the cluster
Sometimes after doing this there's a problem. Sometimes not. Sometimes the problem goes away after several restart attempts. Sometimes not. Is it a race condition or is the local data corrupt? Did I do something wrong when stopping the service? Deleting the local state/data folder does not fix the problem. So basically I can't join a new server to the cluster at after the problem occurs.
Fixing the problem usually entails removing the offending streams from one or more accounts (note that its usually not all accounts). For example in the example below I would delete the cache-updates stream from the dev22 account. After deleting the stream sometimes a new account could have the problem, and I would delete the stream on that as well and so forth until the server is able to join again.
If you have any ideas what I can do to help you debug the problem, let me know.
Can you share the following?
As a system user, and using a recent version of the nats cli.
nats server ls
nats server report jetstream
And as a user for the DEV32 account.
nats stream ls -a
nats stream info cache-updates
Thanks.
Sure, this time it was dev24 that acted up.
# nats server ls
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Server Overview │
├────────────────┬──────────────────┬────────────────┬─────────┬─────┬───────┬──────┬────────┬─────┬─────────┬───────┬───────┬──────┬────────────┬──────┤
│ Name │ Cluster │ Host │ Version │ JS │ Conns │ Subs │ Routes │ GWs │ Mem │ CPU % │ Cores │ Slow │ Uptime │ RTT │
├────────────────┼──────────────────┼────────────────┼─────────┼─────┼───────┼──────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼──────┤
│ DKCPH-VLCDNAT3 │ nats-cluster-dev │ DKCPH-VLCDNAT3 │ 2.9.11 │ yes │ 7 │ 436 │ 1 │ 0 │ 54 MiB │ 10 │ 2 │ 0 │ 16h22m48s │ 8ms │
│ DKCPH-VLCDNAT2 │ nats-cluster-dev │ DKCPH-VLCDNAT2 │ 2.9.11 │ yes │ 7 │ 436 │ 1 │ 0 │ 83 MiB │ 4 │ 2 │ 0 │ 6d5h50m31s │ 12ms │
├────────────────┼──────────────────┼────────────────┼─────────┼─────┼───────┼──────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼──────┤
│ │ 1 │ 2 │ │ 2 │ 14 │ 872 │ │ │ 137 MiB │ │ │ 0 │ │ │
╰────────────────┴──────────────────┴────────────────┴─────────┴─────┴───────┴──────┴────────┴─────┴─────────┴───────┴───────┴──────┴────────────┴──────╯
╭─────────────────────────────────────────────────────────────────────────────────────╮
│ Cluster Overview │
├──────────────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├──────────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ nats-cluster-dev │ 2 │ 0 │ 0 │ 14 │
├──────────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ │ 2 │ 0 │ 0 │ 14 │
╰──────────────────┴────────────┴───────────────────┴───────────────────┴─────────────╯
# nats server report jetstream
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ JetStream Summary │
├─────────────────┬──────────────────┬─────────┬───────────┬──────────┬─────────┬─────────┬──────┬─────────┬─────────┤
│ Server │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├─────────────────┼──────────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────┼─────────┼─────────┤
│ DKCPH-VLCDNAT3 │ nats-cluster-dev │ 6 │ 6 │ 9,284 │ 3.2 MiB │ 3.2 MiB │ 0 B │ 22 │ 0 │
│ DKCPH-VLCDNAT2* │ nats-cluster-dev │ 6 │ 6 │ 9,284 │ 3.2 MiB │ 3.2 MiB │ 0 B │ 31,511 │ 31,321 │
├─────────────────┼──────────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────┼─────────┼─────────┤
│ │ │ 12 │ 12 │ 18,568 │ 6.4 MiB │ 6.4 MiB │ 0 B │ 31,533 │ 31,321 │
╰─────────────────┴──────────────────┴─────────┴───────────┴──────────┴─────────┴─────────┴──────┴─────────┴─────────╯
╭──────────────────────────────────────────────────────────────────────╮
│ RAFT Meta Group Information │
├────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name │ ID │ Leader │ Current │ Online │ Active │ Lag │
├────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ DKCPH-VLCDNAT1 │ MTG7nqV6 │ │ false │ true │ 46.07s │ 0 │
│ DKCPH-VLCDNAT2 │ 9zdjoip0 │ yes │ true │ true │ 0.00s │ 0 │
│ DKCPH-VLCDNAT3 │ iNksFxp1 │ │ true │ true │ 0.07s │ 0 │
╰────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯
# nats context select dev24
NATS Configuration Context "dev24"
Server URLs: nats://dkcph-vlcdnat1:4222,nats://dkcph-vlcdnat2:4222,nats://dkcph-vlcdnat3:4222
Username: dev24
Password: *********
Token: dev24
Path: C:\Users\jhans\.config\nats\context\dev24.json
# nats stream ls -a
╭───────────────────────────────────────────────────────────────────────────────────────╮
│ Streams │
├───────────────┬─────────────┬─────────────────────┬──────────┬─────────┬──────────────┤
│ Name │ Description │ Created │ Messages │ Size │ Last Message │
├───────────────┼─────────────┼─────────────────────┼──────────┼─────────┼──────────────┤
│ cache-updates │ │ 2023-01-16 11:30:48 │ 1,190 │ 417 KiB │ 2.99s │
╰───────────────┴─────────────┴─────────────────────┴──────────┴─────────┴──────────────╯
# nats stream info cache-updates
Information for Stream cache-updates created 2023-01-16 11:30:48
Subjects: cache.update.>
Replicas: 3
Storage: Memory
Options:
Retention: Limits
Acknowledgements: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: unlimited
Maximum Bytes: unlimited
Maximum Age: 30m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats-cluster-dev
Leader: DKCPH-VLCDNAT3
Replica: DKCPH-VLCDNAT1, outdated, seen 2m28s ago, 99 operations behind
Replica: DKCPH-VLCDNAT2, current, seen 0.53s ago
State:
Messages: 1,190
Bytes: 417 KiB
FirstSeq: 17,516 @ 2023-01-16T18:13:39 UTC
LastSeq: 18,705 @ 2023-01-16T18:43:29 UTC
Active Consumers: 1
Number of Subjects: 2
Here's the ouput from the server startup:
[8156] 2023/01/16 19:51:52.679909 [[32mINF[0m] Starting nats-server
[8156] 2023/01/16 19:51:52.688240 [[32mINF[0m] Version: 2.9.11
[8156] 2023/01/16 19:51:52.688773 [[32mINF[0m] Git: [23ffc16]
[8156] 2023/01/16 19:51:52.690692 [[32mINF[0m] Cluster: nats-cluster-dev
[8156] 2023/01/16 19:51:52.692667 [[32mINF[0m] Name: DKCPH-VLCDNAT1
[8156] 2023/01/16 19:51:52.693224 [[32mINF[0m] Node: MTG7nqV6
[8156] 2023/01/16 19:51:52.693747 [[32mINF[0m] ID: NA5EQCP4QDSE6TRDDI33BBFW6DRSIM4HVVUF67WJ5KZLFGEXQTFUJ6H2
[8156] 2023/01/16 19:51:52.694268 [[0;93mWRN[0m] Plaintext passwords detected, use nkeys or bcrypt
[8156] 2023/01/16 19:51:52.694790 [[32mINF[0m] Using configuration file: d:\nats-server\config\server.conf
[8156] 2023/01/16 19:51:52.700736 [[32mINF[0m] Starting http monitor on DKCPH-VLCDNAT1:8222
[8156] 2023/01/16 19:51:52.701243 [[32mINF[0m] Starting JetStream
[8156] 2023/01/16 19:51:52.701911 [[32mINF[0m] _ ___ _____ ___ _____ ___ ___ _ __ __
[8156] 2023/01/16 19:51:52.702301 [[32mINF[0m] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[8156] 2023/01/16 19:51:52.702836 [[32mINF[0m] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[8156] 2023/01/16 19:51:52.703365 [[32mINF[0m] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[8156] 2023/01/16 19:51:52.703893 [[32mINF[0m]
[8156] 2023/01/16 19:51:52.707820 [[32mINF[0m] https://docs.nats.io/jetstream
[8156] 2023/01/16 19:51:52.711112 [[32mINF[0m]
[8156] 2023/01/16 19:51:52.711626 [[32mINF[0m] ---------------- JETSTREAM ----------------
[8156] 2023/01/16 19:51:52.712461 [[32mINF[0m] Max Memory: 3.00 GB
[8156] 2023/01/16 19:51:52.712972 [[32mINF[0m] Max Storage: 30.00 GB
[8156] 2023/01/16 19:51:52.713795 [[32mINF[0m] Store Directory: "d:\nats-server\data\jetstream"
[8156] 2023/01/16 19:51:52.714305 [[32mINF[0m] -------------------------------------------
[8156] 2023/01/16 19:51:52.722670 [[32mINF[0m] Starting JetStream cluster
[8156] 2023/01/16 19:51:52.724497 [[32mINF[0m] Creating JetStream metadata controller
[8156] 2023/01/16 19:51:52.726441 [[32mINF[0m] JetStream cluster recovering state
[8156] 2023/01/16 19:51:52.728538 [[32mINF[0m] Listening for client connections on DKCPH-VLCDNAT1:4222
[8156] 2023/01/16 19:51:52.729895 [[32mINF[0m] Server is ready
[8156] 2023/01/16 19:51:52.730406 [[32mINF[0m] Cluster name is nats-cluster-dev
[8156] 2023/01/16 19:51:52.731279 [[32mINF[0m] Listening for route connections on DKCPH-VLCDNAT1:6222
[8156] 2023/01/16 19:51:52.738608 [[32mINF[0m] 10.110.53.112:6222 - rid:71 - Route connection created
[8156] 2023/01/16 19:51:52.747038 [[32mINF[0m] 10.110.53.113:6222 - rid:72 - Route connection created
[8156] 2023/01/16 19:51:52.846302 [[0;93mWRN[0m] Waiting for routing to be established...
[8156] 2023/01/16 19:51:53.025610 [[32mINF[0m] 10.110.53.113:49823 - rid:73 - Route connection created
[8156] 2023/01/16 19:51:53.031251 [[32mINF[0m] 10.110.53.113:49823 - rid:73 - Router connection closed: Duplicate Route
[8156] 2023/01/16 19:51:53.113164 [[32mINF[0m] 10.110.53.112:57498 - rid:74 - Route connection created
[8156] 2023/01/16 19:51:53.115043 [[32mINF[0m] 10.110.53.112:57498 - rid:74 - Router connection closed: Duplicate Route
[8156] 2023/01/16 19:51:53.282989 [[0;93mWRN[0m] Error applying entries to 'DEV24 > cache-updates': invalid character '\x16' looking for beginning of value
panic: JetStream Cluster Unknown group entry op type!
goroutine 93 [running]:
github.com/nats-io/nats-server/v2/server.(*jetStream).applyStreamEntries(0xc0002ac000, 0xc000079c00, 0x0?, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:2464 +0x1e1a
github.com/nats-io/nats-server/v2/server.(*jetStream).monitorStream(0xc0002ac000, 0xc000079c00, 0xc0002f9290, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1902 +0x26e5
github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateStream.func1()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3132 +0x2a
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
/home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3083 +0x85
Here's the server info from before I took DKCPH-VLCDNAT1 down:
# nats server ls
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Server Overview │
├────────────────┬──────────────────┬────────────────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬────────────┬──────┤
│ Name │ Cluster │ Host │ Version │ JS │ Conns │ Subs │ Routes │ GWs │ Mem │ CPU % │ Cores │ Slow │ Uptime │ RTT │
├────────────────┼──────────────────┼────────────────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼──────┤
│ DKCPH-VLCDNAT3 │ nats-cluster-dev │ DKCPH-VLCDNAT3 │ 2.9.11 │ yes │ 3 │ 554 │ 2 │ 0 │ 52 MiB │ 13 │ 2 │ 0 │ 16h14m45s │ 8ms │
│ DKCPH-VLCDNAT1 │ nats-cluster-dev │ DKCPH-VLCDNAT1 │ 2.9.11 │ yes │ 7 │ 558 │ 2 │ 0 │ 54 MiB │ 9 │ 2 │ 0 │ 9h4m31s │ 8ms │
│ DKCPH-VLCDNAT2 │ nats-cluster-dev │ DKCPH-VLCDNAT2 │ 2.9.11 │ yes │ 4 │ 554 │ 2 │ 0 │ 77 MiB │ 12 │ 2 │ 0 │ 6d5h42m29s │ 10ms │
├────────────────┼──────────────────┼────────────────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼──────┤
│ │ 1 │ 3 │ │ 3 │ 14 │ 1,666 │ │ │ 182 MiB │ │ │ 0 │ │ │
╰────────────────┴──────────────────┴────────────────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴────────────┴──────╯
╭─────────────────────────────────────────────────────────────────────────────────────╮
│ Cluster Overview │
├──────────────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├──────────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ nats-cluster-dev │ 3 │ 0 │ 0 │ 14 │
├──────────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ │ 3 │ 0 │ 0 │ 14 │
╰──────────────────┴────────────┴───────────────────┴───────────────────┴─────────────╯
# nats server report jetstream
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ JetStream Summary │
├─────────────────┬──────────────────┬─────────┬───────────┬──────────┬─────────┬─────────┬──────┬─────────┬─────────┤
│ Server │ Cluster │ Streams │ Consumers │ Messages │ Bytes │ Memory │ File │ API Req │ API Err │
├─────────────────┼──────────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────┼─────────┼─────────┤
│ DKCPH-VLCDNAT3 │ nats-cluster-dev │ 6 │ 6 │ 9,336 │ 3.2 MiB │ 3.2 MiB │ 0 B │ 20 │ 0 │
│ DKCPH-VLCDNAT1 │ nats-cluster-dev │ 6 │ 6 │ 6,950 │ 2.3 MiB │ 2.3 MiB │ 0 B │ 22 │ 0 │
│ DKCPH-VLCDNAT2* │ nats-cluster-dev │ 6 │ 6 │ 9,336 │ 3.2 MiB │ 3.2 MiB │ 0 B │ 31,161 │ 30,973 │
├─────────────────┼──────────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────┼─────────┼─────────┤
│ │ │ 18 │ 18 │ 25,622 │ 8.8 MiB │ 8.8 MiB │ 0 B │ 31,203 │ 30,973 │
╰─────────────────┴──────────────────┴─────────┴───────────┴──────────┴─────────┴─────────┴──────┴─────────┴─────────╯
╭──────────────────────────────────────────────────────────────────────╮
│ RAFT Meta Group Information │
├────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name │ ID │ Leader │ Current │ Online │ Active │ Lag │
├────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ DKCPH-VLCDNAT1 │ MTG7nqV6 │ │ true │ true │ 0.48s │ 0 │
│ DKCPH-VLCDNAT2 │ 9zdjoip0 │ yes │ true │ true │ 0.00s │ 0 │
│ DKCPH-VLCDNAT3 │ iNksFxp1 │ │ true │ true │ 0.48s │ 0 │
╰────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯
I'm also getting this error sometimes when starting mentioning corrupt state in a consumer snapshot:
[6736] 2023/01/16 19:50:07.100929 [[32mINF[0m] Starting nats-server
[6736] 2023/01/16 19:50:07.101451 [[32mINF[0m] Version: 2.9.11
[6736] 2023/01/16 19:50:07.102058 [[32mINF[0m] Git: [23ffc16]
[6736] 2023/01/16 19:50:07.102650 [[32mINF[0m] Cluster: nats-cluster-dev
[6736] 2023/01/16 19:50:07.103189 [[32mINF[0m] Name: DKCPH-VLCDNAT1
[6736] 2023/01/16 19:50:07.103724 [[32mINF[0m] Node: MTG7nqV6
[6736] 2023/01/16 19:50:07.104254 [[32mINF[0m] ID: NAVTCLEETMODB35IPM4TSBOVB5XF55ATNIZPJCC3IKM3CKNDYR2XVD4H
[6736] 2023/01/16 19:50:07.104777 [[0;93mWRN[0m] Plaintext passwords detected, use nkeys or bcrypt
[6736] 2023/01/16 19:50:07.105341 [[32mINF[0m] Using configuration file: d:\nats-server\config\server.conf
[6736] 2023/01/16 19:50:07.114807 [[32mINF[0m] Starting http monitor on DKCPH-VLCDNAT1:8222
[6736] 2023/01/16 19:50:07.115381 [[32mINF[0m] Starting JetStream
[6736] 2023/01/16 19:50:07.117983 [[32mINF[0m] _ ___ _____ ___ _____ ___ ___ _ __ __
[6736] 2023/01/16 19:50:07.118494 [[32mINF[0m] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[6736] 2023/01/16 19:50:07.120273 [[32mINF[0m] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[6736] 2023/01/16 19:50:07.121174 [[32mINF[0m] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[6736] 2023/01/16 19:50:07.121687 [[32mINF[0m]
[6736] 2023/01/16 19:50:07.122539 [[32mINF[0m] https://docs.nats.io/jetstream
[6736] 2023/01/16 19:50:07.123050 [[32mINF[0m]
[6736] 2023/01/16 19:50:07.123860 [[32mINF[0m] ---------------- JETSTREAM ----------------
[6736] 2023/01/16 19:50:07.124370 [[32mINF[0m] Max Memory: 3.00 GB
[6736] 2023/01/16 19:50:07.125715 [[32mINF[0m] Max Storage: 30.00 GB
[6736] 2023/01/16 19:50:07.128959 [[32mINF[0m] Store Directory: "d:\nats-server\data\jetstream"
[6736] 2023/01/16 19:50:07.129473 [[32mINF[0m] -------------------------------------------
[6736] 2023/01/16 19:50:07.138807 [[32mINF[0m] Starting JetStream cluster
[6736] 2023/01/16 19:50:07.138807 [[32mINF[0m] Creating JetStream metadata controller
[6736] 2023/01/16 19:50:07.140170 [[32mINF[0m] JetStream cluster recovering state
[6736] 2023/01/16 19:50:07.140678 [[32mINF[0m] Listening for client connections on DKCPH-VLCDNAT1:4222
[6736] 2023/01/16 19:50:07.146230 [[32mINF[0m] Server is ready
[6736] 2023/01/16 19:50:07.150521 [[32mINF[0m] Cluster name is nats-cluster-dev
[6736] 2023/01/16 19:50:07.160689 [[32mINF[0m] Listening for route connections on DKCPH-VLCDNAT1:6222
[6736] 2023/01/16 19:50:07.166618 [[32mINF[0m] 10.110.53.113:6222 - rid:71 - Route connection created
[6736] 2023/01/16 19:50:07.167700 [[32mINF[0m] 10.110.53.112:6222 - rid:72 - Route connection created
[6736] 2023/01/16 19:50:07.264781 [[0;93mWRN[0m] Waiting for routing to be established...
[6736] 2023/01/16 19:50:07.301621 [[32mINF[0m] JetStream cluster new metadata leader: DKCPH-VLCDNAT2/nats-cluster-dev
[6736] 2023/01/16 19:50:07.304642 [[31mERR[0m] JetStream cluster could not decode consumer snapshot for 'DEV24 > cache-updates > dkcph-vlcdapp24-velocityhost-consumer' [C-R3M-xRnu743F]
panic: corrupt state file
goroutine 50 [running]:
github.com/nats-io/nats-server/v2/server.(*jetStream).applyConsumerEntries(0xc0000fe000, 0xc0002fe480, 0x0?, 0x0)
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:4110 +0x85a
github.com/nats-io/nats-server/v2/server.(*jetStream).monitorConsumer(0xc0000fe000, 0xc0002fe480, 0xc0002accf0)
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3989 +0xf76
github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateConsumer.func1()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3699 +0x25
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
/home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3083 +0x85
Note there were also a few other accounts that were malfunctioning, but not DEV19 this time.
All the accounts are set up exactly the same.
Why is server DKCPH-VLCDNAT1 offline?
It seems like the raft log has been corrupted.
Try scaling down cache updates to 1 and then backup.
As the account user.
nats stream update cache-updates --replicas 1 -f
Then
nats stream update cache-updates --replicas 3 -f
dkcph-vlcdnat1 was taken offline by me intentionally (could be due to server patching, or a rolling cluster update), and this is provoking the problem. We initially found it doing a rolling update, but it also occurs just by shutting down a node and bringing it back online.
I'm guessing this is a bug and not expected behavior? I mean, how does the raft log get corrupted during a normal shutdown/startup cycle of a node?
And another question why does it suddenly work after trying to restart the server, x amount of times?
I think upgrading to the latest 2.9.11 is the first order of business. Which you have done, thank you.
Second is make sure you have a recent NATS cli version running and that you have credentials for the system account with a system user.
Once we are there we can engage a bit more.
I have the latest of both nats-server (2.9.11) and nats-cli (0.0.35).
The nats server report jetstream command needs a system account so I also have that in place. ;)
I guess it doesn't have anything to do with the cli client version, since the problem is showing during either nats-server shutdown or startup.
Here's some -DVV logs from nats-server. One where nats-server exits with "panic: corrupt state file" and one where it exits with "panic: JetStream Cluster Unknown group entry op type!" unknown-group-entry-op-type.log corrupt-state-file.log
Retrying to start the server on the same cluster multiple times it finally succeeded. Here's the -DVV log from the successful startup including a normal shutdown. succesful-startup.log
Note all of these logs were generated without changing anything in the cluster or locally. Sometimes nats-server fails to startup with various error messages, sometimes it succeeds.
I'll try to repro this on my local machine, so it will be easier to debug. I'm suspecting it has something to do with multi tenancy as it is not occurring as much on our other clusters with fewer tenants. But of course this is just speculation at this point.
The consumer snapshot seems to be corrupt.
Have you tried deleting that consumer and re-creating?
Yes, it doesn't work. Only thing that works is retrying starting the nats-server process. Sometimes it works.
Another thing I noticed is when running nats server report jetstream after I close one of the nodes. The node is not listed as being offline, it is just listed as not being current. Why doesn't raft realize that the server is offline?
Another thing I saw was that the other nodes in the cluster sometimes panic after I try to restart the node I stopped myself. That is very bad. As trying to recover a semi-functional cluster can bring it down entirely! :(
Questions I worry and wonder about:
- Why does it sometimes work to retry starting the process?
- Why does it only happen after a shutdown of a node in the cluster, and not while normal operation?
- If some state is indeed corrupt. How does that happen? Note it happens consistently in our dev cluster even when I totally repave the cluster (i.e. stop all nodes, delete data folders, restart all nodes), I'm unsure if our accept and production clusters experience the same problems.
- It seems wrong that a consumer should be able to corrupt state in a way that prevents the cluster from recovering.
- It seems very bad that a startup of a node in a cluster is able to panic the other servers in said cluster.
Is it possible to pay for direct support from you guys? I really feel I'm in over my head. As our production system is currently relying on our nats cluster I'm very worried, that I will make things more unstable as I try to fix this issue. I realize it could be some stupid thing I did, or we have a bad host somehow, but I don't really know what to look for.
Yes, it doesn't work. Only thing that works is retrying starting the nats-server process. Sometimes it works.
Another thing I noticed is when running
nats server report jetstreamafter I close one of the nodes. The node is not listed as being offline, it is just listed as not being current. Why doesn't raft realize that the server is offline?
Servers wait for some time to declare a server down. This allows for temporary disruptions etc.
If the server is shutdown gracefully it will signal to the others that it is going offline and that should be reported immediately.
Another thing I saw was that the other nodes in the cluster sometimes panic after I try to restart the node I stopped myself. That is very bad. As trying to recover a semi-functional cluster can bring it down entirely! :(
Can you share what is being printed out when the server's panic? And double checking you are running the current server, 2.9.11 yes?
Questions I worry and wonder about:
- Why does it sometimes work to retry starting the process?
Unclear at this time, we would need to do a formal triage with Zoom or something to properly triage.
- Why does it only happen after a shutdown of a node in the cluster, and not while normal operation?
Unclear, please share what is printed out when the others panic.
- If some state is indeed corrupt. How does that happen? Note it happens consistently in our dev cluster even when I totally repave the cluster (i.e. stop all nodes, delete data folders, restart all nodes), I'm unsure if our accept and production clusters experience the same problems.
This is something we have not seen, so we would want to get a support call scheduled and do a Zoom and get access to more of the internals of your system to diagnose.
- It seems wrong that a consumer should be able to corrupt state in a way that prevents the cluster from recovering.
Agree.
- It seems very bad that a startup of a node in a cluster is able to panic the other servers in said cluster.
Again agree. On same page, but we need much more information to help out. These are very complicated systems under the covers, even if they appear simple from the outside.
Is it possible to pay for direct support from you guys? I really feel I'm in over my head. As our production system is currently relying on our nats cluster I'm very worried, that I will make things more unstable as I try to fix this issue. I realize it could be some stupid thing I did, or we have a bad host somehow, but I don't really know what to look for.
Yes send me an email and I will connect you with the right folks, but let's get a Zoom calendar meeting scheduled regardless asap. (derek at synadia dot com)
@derekcollison thanks for your help! Just to lighten the mood a bit, I think it's really great that you take the time to try and understand what is most likely something stupid we did on our end. I really like nats-server as a product, and again thanks for your assistance so far. :) I agree that the product is complicated, and it is not helping that the way the problem is manifesting itself is very confusing, and in several different ways, :)
Before I get back to you regarding a meeting I just have to talk to some colleagues on Monday. I'll get back to you ASAP, so we can try and set up a meeting. I'll also try one more time to make a local repro of the problem.
Unfortunately, I don't have the logs from the other servers when they panicked. It's hard to provoke that particular manifestation of the issue, and every time the dev cluster is down I disrupt the entire team of developers that are using the cluster for their test environments. I try to avoid those disruptions as much as I can. If I had a local repro of the problem it would be much faster/easier to debug it and give you the steps to repro the problem by yourself.
I'll try to contact you on Monday. Thanks again!