nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

Can't keep Jetstream cluster working when execute rolling upgrade from 2.5.0 to 2.7.4

Open phho opened this issue 4 years ago • 4 comments

Versions of nats-server and affected client libraries used:

  • nats-server 2.7.4
  • nats helm chart 0.14.2

OS/Container environment:

  • K8s

Steps or code to reproduce the issue:

  1. Existing nats 2.5.0 cluster (5 instances) with Jetstream enabled
  2. Configure podManagementPolicy: OrderedReady ( https://github.com/nats-io/k8s/tree/main/helm/charts/nats#breaking-change-log ) for old helm chart deployment compatible
  3. When execute rolling upgrade to 2.7.4 on ordinal 4 and ordinal 3 pod. Ordinal 2, 1, and 0 (those 3 pods still in old version 2.5.0) nats pod process start panic with following error logs and can't continue to upgrade. NATS cluster can't form a Jetstream meta Group due to only have two servers alive.
[1652] 2022/03/21 02:49:11.632347 [INF] Listening for route connections on [0.0.0.0:6222](http://0.0.0.0:6222/)
[1652] 2022/03/21 02:49:11.633565 [INF]…:6222 - rid:225 - Route connection created
[1652] 2022/03/21 02:49:11.633658 [INF] …:6222 - rid:226 - Route connection created
[1652] 2022/03/21 02:49:11.633736 [INF] …:6222 - rid:227 - Route connection created
[1652] 2022/03/21 02:49:11.633760 [INF] …:6222 - rid:228 - Route connection created
panic: JetStream Cluster Unknown group entry op type! 12
goroutine 284 [running]:
[github.com/nats-io/nats-server/server.(*jetStream).applyConsumerEntries(0xc0002d4000](http://github.com/nats-io/nats-server/server.(*jetStream).applyConsumerEntries(0xc0002d4000), 0xc000445c00, 0xc001ef6940, 0x0, 0x5, 0x1)
        /home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2948](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2948) +0x5e7
[github.com/nats-io/nats-server/server.(*jetStream).monitorConsumer(0xc0002d4000](http://github.com/nats-io/nats-server/server.(*jetStream).monitorConsumer(0xc0002d4000), 0xc000445c00, 0xc000cbb0e0)
        /home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2874](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2874) +0x65f
[github.com/nats-io/nats-server/server.(*jetStream).processClusterCreateConsumer.func1()](http://github.com/nats-io/nats-server/server.(*jetStream).processClusterCreateConsumer.func1())
        /home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2691](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2691) +0x3c
created by [github.com/nats-io/nats-server/server.(*Server).startGoRoutine](http://github.com/nats-io/nats-server/server.(*Server).startGoRoutine)
        /home/travis/gopath/src/[github.com/nats-io/nats-server/server/server.go:2867](http://github.com/nats-io/nats-server/server/server.go:2867) +0xc5

phho avatar Mar 21 '22 14:03 phho

About the issue, please let me know whether need to provide more information to clarify root cause, thanks. We can found 2.7.4 have more entryOp type (https://github.com/nats-io/nats-server/blob/v2.7.4/server/jetstream_cluster.go#L87-L88) than 2.5.0 (https://github.com/nats-io/nats-server/blob/v2.5.0/server/jetstream_cluster.go#L81). Is this why old and new versions of nats-server cannot work at the same time? thanks.

phho avatar Mar 22 '22 16:03 phho

@phho could you provide the values files given to helm install of 2.5.0 and helm upgrade of 2.7.4? like the way this issue being reported

Based on your description, it seems that the first pod being upgraded to 2.7.4 kills other not-upgraded-yet pods with a newly added message type PendingRequests. Interesting, according to the code, NATS carefully handles the situation that a cluster is consist of nodes in different versions. It checks peer's version before sending out messages may be incompatible. It assumes that peer would be soon upgraded to new version and then it could understand the new message.

changtimwu avatar May 28 '22 06:05 changtimwu

Latest server release is 2.8.4.

derekcollison avatar May 28 '22 16:05 derekcollison

I upgraded NATS from 2.8.2 to 2.8.4 and experienced 5 minutes of downtime. The NATS log is as follows:

[nats-1 nats] [7] 2022/06/15 16:03:05.149963 [ERR] JetStream failed to store a msg on stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': expected sequence does not match store 
[nats-0 nats] [9] 2022/06/15 16:04:05.164590 [INF] JetStream cluster new stream leader for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' 
[nats-1 nats] [7] 2022/06/15 16:04:05.153333 [WRN] Error applying entries to 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': last sequence mismatch 
[nats-1 nats] [7] 2022/06/15 16:04:05.188851 [WRN] Resetting stream cluster state for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' 
[nats-0 nats] [9] 2022/06/15 16:04:06.224973 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': malformed or corrupt message 
[nats-2 nats] [7] 2022/06/15 16:04:09.543781 [INF] JetStream cluster new consumer leader for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE > FUGA' 
[nats-1 nats] [7] 2022/06/15 16:04:09.716224 [INF] JetStream cluster new consumer leader for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE > FOO' 
[nats-1 nats] [7] 2022/06/15 16:04:16.167494 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-0 nats] [9] 2022/06/15 16:04:16.167943 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-0 nats] [9] 2022/06/15 16:04:26.168326 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-1 nats] [7] 2022/06/15 16:04:26.167825 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-1 nats] [7] 2022/06/15 16:04:36.168886 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-0 nats] [9] 2022/06/15 16:04:36.170278 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-1 nats] [7] 2022/06/15 16:04:46.168983 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-0 nats] [9] 2022/06/15 16:04:46.169485 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-1 nats] [7] 2022/06/15 16:04:56.169085 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-0 nats] [9] 2022/06/15 16:04:56.169566 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-0 nats] [9] 2022/06/15 16:05:05.155487 [ERR] JetStream failed to store a msg on stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': expected sequence does not match store 
[nats-0 nats] [9] 2022/06/15 16:05:06.169968 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-1 nats] [7] 2022/06/15 16:05:06.169579 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-1 nats] [7] 2022/06/15 16:05:16.170553 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-0 nats] [9] 2022/06/15 16:05:16.171853 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-1 nats] [7] 2022/06/15 16:05:26.172215 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-0 nats] [9] 2022/06/15 16:05:26.174616 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-1 nats] [7] 2022/06/15 16:05:36.172942 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-0 nats] [9] 2022/06/15 16:05:36.174129 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-0 nats] [9] 2022/06/15 16:05:46.174158 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-1 nats] [7] 2022/06/15 16:05:46.173607 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-1 nats] [7] 2022/06/15 16:05:56.173945 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-0 nats] [9] 2022/06/15 16:05:56.174407 [WRN] Error loading message for catchup 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': stream store EOF 
[nats-0 nats] [9] 2022/06/15 16:06:05.158048 [WRN] Error applying entries to 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE': last sequence mismatch 
[nats-0 nats] [9] 2022/06/15 16:06:05.166286 [WRN] Resetting stream cluster state for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' 
[nats-1 nats] [7] 2022/06/15 16:06:06.174356 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-1 nats] [7] 2022/06/15 16:06:16.174900 [WRN] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' stalled 
[nats-2 nats] [7] 2022/06/15 16:06:17.044857 [INF] JetStream cluster new stream leader for 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' 
[nats-2 nats] [7] 2022/06/15 16:06:17.048759 [INF] Catchup for stream 'ADKTKO2PZJ5J2XSDTEZXHLOF6DLQVWPRAAXA5SY3EP7ACRXDGK6KKMFG > HOGE' complete

I have NATS installed with helm and my values.yaml is as follows:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - podAffinityTerm:
        labelSelector:
          matchLabels:
            app.kubernetes.io/instance: nats
            app.kubernetes.io/name: nats
        topologyKey: topology.kubernetes.io/zone
      weight: 100
    - podAffinityTerm:
        labelSelector:
          matchLabels:
            app.kubernetes.io/instance: nats
            app.kubernetes.io/name: nats
        topologyKey: kubernetes.io/hostname
      weight: 1
auth:
  enabled: true
  resolver:
    allowDelete: true
    interval: 1m
    operator: foo
    resolverPreload:
      AASU727EDXEETDBIV4GLBLAHR75RIDPGSTDO5WYKHBJRM6G2U2V3BWWR: bar
    store:
      dir: /accounts/jwt
      size: 1Gi
    systemAccount: fuga
    type: full
cluster:
  enabled: true
  noAdvertise: true
  replicas: 3
nats:
  image: nats:2.8.4-alpine
  jetstream:
    domain: hoge
    enabled: true
    encryption:
      key: hogehoge
    fileStorage:
      accessModes:
      - ReadWriteOnce
      enabled: true
      size: 10Gi
      storageDirectory: /data
    memStorage:
      enabled: false
podDisruptionBudget:
  maxUnavailable: 0
  minAvailable: 2
websocket:
  enabled: true
  noTLS: true
  port: 8080

mugioka avatar Jun 15 '22 16:06 mugioka

We believe this has been greatly improved in latest versions of the server, 2.9.10 / 2.9.11 and the newest operator.

Feel free to re-open however if needed.

derekcollison avatar Jan 06 '23 17:01 derekcollison

@derekcollison I'm experiencing a similar issue on 2.9.11.

[1340] 2023/01/16 09:41:50.801208 [[32mINF[0m] Starting nats-server
[1340] 2023/01/16 09:41:50.801723 [[32mINF[0m]   Version:  2.9.11
[1340] 2023/01/16 09:41:50.802288 [[32mINF[0m]   Git:      [23ffc16]
[1340] 2023/01/16 09:41:50.802812 [[32mINF[0m]   Cluster:  nats-cluster-dev
[1340] 2023/01/16 09:41:50.803319 [[32mINF[0m]   Name:     DKCPH-VLCDNAT1
[1340] 2023/01/16 09:41:50.803443 [[32mINF[0m]   Node:     MTG7nqV6
[1340] 2023/01/16 09:41:50.803971 [[32mINF[0m]   ID:       NC5HMVWQHIEEZIDQQ2YXRWC4BUGULQJ43A6R6DLY3QR5JFMGALO7NHAE
[1340] 2023/01/16 09:41:50.803971 [[0;93mWRN[0m] Plaintext passwords detected, use nkeys or bcrypt
[1340] 2023/01/16 09:41:50.804494 [[32mINF[0m] Using configuration file: d:\nats-server\config\server.conf
[1340] 2023/01/16 09:41:50.812024 [[32mINF[0m] Starting http monitor on DKCPH-VLCDNAT1:8222
[1340] 2023/01/16 09:41:50.812939 [[32mINF[0m] Starting JetStream
[1340] 2023/01/16 09:41:50.813466 [[32mINF[0m]     _ ___ _____ ___ _____ ___ ___   _   __  __
[1340] 2023/01/16 09:41:50.814620 [[32mINF[0m]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[1340] 2023/01/16 09:41:50.815130 [[32mINF[0m] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[1340] 2023/01/16 09:41:50.815186 [[32mINF[0m]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[1340] 2023/01/16 09:41:50.815716 [[32mINF[0m]
[1340] 2023/01/16 09:41:50.815716 [[32mINF[0m]          https://docs.nats.io/jetstream
[1340] 2023/01/16 09:41:50.816261 [[32mINF[0m]
[1340] 2023/01/16 09:41:50.816792 [[32mINF[0m] ---------------- JETSTREAM ----------------
[1340] 2023/01/16 09:41:50.816792 [[32mINF[0m]   Max Memory:      3.00 GB
[1340] 2023/01/16 09:41:50.817316 [[32mINF[0m]   Max Storage:     30.00 GB
[1340] 2023/01/16 09:41:50.817316 [[32mINF[0m]   Store Directory: "d:\nats-server\data\jetstream"
[1340] 2023/01/16 09:41:50.817912 [[32mINF[0m] -------------------------------------------
[1340] 2023/01/16 09:41:50.826315 [[32mINF[0m] Starting JetStream cluster
[1340] 2023/01/16 09:41:50.826830 [[32mINF[0m] Creating JetStream metadata controller
[1340] 2023/01/16 09:41:50.827541 [[32mINF[0m] JetStream cluster recovering state
[1340] 2023/01/16 09:41:50.828575 [[32mINF[0m] Listening for client connections on DKCPH-VLCDNAT1:4222
[1340] 2023/01/16 09:41:50.830395 [[32mINF[0m] Server is ready
[1340] 2023/01/16 09:41:50.830395 [[32mINF[0m] Cluster name is nats-cluster-dev
[1340] 2023/01/16 09:41:50.830995 [[32mINF[0m] Listening for route connections on DKCPH-VLCDNAT1:6222
[1340] 2023/01/16 09:41:50.834923 [[32mINF[0m] 10.110.53.112:6222 - rid:57 - Route connection created
[1340] 2023/01/16 09:41:50.837936 [[32mINF[0m] 10.110.53.113:6222 - rid:63 - Route connection created
[1340] 2023/01/16 09:41:50.945865 [[0;93mWRN[0m] Waiting for routing to be established...
[1340] 2023/01/16 09:41:51.083640 [[32mINF[0m] 10.110.53.112:50914 - rid:87 - Route connection created
[1340] 2023/01/16 09:41:51.086342 [[32mINF[0m] 10.110.53.112:50914 - rid:87 - Router connection closed: Duplicate Route
[1340] 2023/01/16 09:41:51.104360 [[32mINF[0m] 10.110.53.113:56504 - rid:88 - Route connection created
[1340] 2023/01/16 09:41:51.104360 [[32mINF[0m] 10.110.53.113:56504 - rid:88 - Router connection closed: Duplicate Route
[1340] 2023/01/16 09:41:51.349264 [[0;93mWRN[0m] Error applying entries to 'DEV32 > cache-updates': invalid character '\x16' looking for beginning of value
panic: JetStream Cluster Unknown group entry op type!

goroutine 73 [running]:
github.com/nats-io/nats-server/v2/server.(*jetStream).applyStreamEntries(0xc00030e000, 0xc00044e000, 0x0?, 0x0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:2464 +0x1e1a
github.com/nats-io/nats-server/v2/server.(*jetStream).monitorStream(0xc00030e000, 0xc00044e000, 0xc0002a8f30, 0x0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1902 +0x26e5
github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateStream.func1()
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3132 +0x2a
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3083 +0x85

Additional info: We have a 3-server cluster running on windows with multiple accounts (multi-tenant) for each of our dev environments the streams are memory streams. I haven't been able to figure out what is the root cause of the problem. It usually happens after doing something like this:

  1. Stop one of the nats-server processes (i.e. stopping a windows service)
  2. Restart the service so the nats-server node can rejoin the cluster

Sometimes after doing this there's a problem. Sometimes not. Sometimes the problem goes away after several restart attempts. Sometimes not. Is it a race condition or is the local data corrupt? Did I do something wrong when stopping the service? Deleting the local state/data folder does not fix the problem. So basically I can't join a new server to the cluster at after the problem occurs.

Fixing the problem usually entails removing the offending streams from one or more accounts (note that its usually not all accounts). For example in the example below I would delete the cache-updates stream from the dev22 account. After deleting the stream sometimes a new account could have the problem, and I would delete the stream on that as well and so forth until the server is able to join again.

If you have any ideas what I can do to help you debug the problem, let me know.

JohannesEH avatar Jan 16 '23 09:01 JohannesEH

Can you share the following?

As a system user, and using a recent version of the nats cli.

nats server ls nats server report jetstream

And as a user for the DEV32 account.

nats stream ls -a nats stream info cache-updates

Thanks.

derekcollison avatar Jan 16 '23 14:01 derekcollison

Sure, this time it was dev24 that acted up.

# nats server ls
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                    Server Overview                                                                    │
├────────────────┬──────────────────┬────────────────┬─────────┬─────┬───────┬──────┬────────┬─────┬─────────┬───────┬───────┬──────┬────────────┬──────┤
│ Name           │ Cluster          │ Host           │ Version │ JS  │ Conns │ Subs │ Routes │ GWs │ Mem     │ CPU % │ Cores │ Slow │ Uptime     │ RTT  │
├────────────────┼──────────────────┼────────────────┼─────────┼─────┼───────┼──────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼──────┤
│ DKCPH-VLCDNAT3 │ nats-cluster-dev │ DKCPH-VLCDNAT3 │ 2.9.11  │ yes │ 7     │ 436  │ 1      │ 0   │ 54 MiB  │ 10    │ 2     │ 0    │ 16h22m48s  │ 8ms  │
│ DKCPH-VLCDNAT2 │ nats-cluster-dev │ DKCPH-VLCDNAT2 │ 2.9.11  │ yes │ 7     │ 436  │ 1      │ 0   │ 83 MiB  │ 4     │ 2     │ 0    │ 6d5h50m31s │ 12ms │
├────────────────┼──────────────────┼────────────────┼─────────┼─────┼───────┼──────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼──────┤
│                │ 1                │ 2              │         │ 2   │ 14    │ 872  │        │     │ 137 MiB │       │       │ 0    │            │      │
╰────────────────┴──────────────────┴────────────────┴─────────┴─────┴───────┴──────┴────────┴─────┴─────────┴───────┴───────┴──────┴────────────┴──────╯

╭─────────────────────────────────────────────────────────────────────────────────────╮
│                                  Cluster Overview                                   │
├──────────────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster          │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├──────────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ nats-cluster-dev │ 2          │ 0                 │ 0                 │ 14          │
├──────────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│                  │ 2          │ 0                 │ 0                 │ 14          │
╰──────────────────┴────────────┴───────────────────┴───────────────────┴─────────────╯

# nats server report jetstream
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                 JetStream Summary                                                  │
├─────────────────┬──────────────────┬─────────┬───────────┬──────────┬─────────┬─────────┬──────┬─────────┬─────────┤
│ Server          │ Cluster          │ Streams │ Consumers │ Messages │ Bytes   │ Memory  │ File │ API Req │ API Err │
├─────────────────┼──────────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────┼─────────┼─────────┤
│ DKCPH-VLCDNAT3  │ nats-cluster-dev │ 6       │ 6         │ 9,284    │ 3.2 MiB │ 3.2 MiB │ 0 B  │ 22      │ 0       │
│ DKCPH-VLCDNAT2* │ nats-cluster-dev │ 6       │ 6         │ 9,284    │ 3.2 MiB │ 3.2 MiB │ 0 B  │ 31,511  │ 31,321  │
├─────────────────┼──────────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────┼─────────┼─────────┤
│                 │                  │ 12      │ 12        │ 18,568   │ 6.4 MiB │ 6.4 MiB │ 0 B  │ 31,533  │ 31,321  │
╰─────────────────┴──────────────────┴─────────┴───────────┴──────────┴─────────┴─────────┴──────┴─────────┴─────────╯

╭──────────────────────────────────────────────────────────────────────╮
│                     RAFT Meta Group Information                      │
├────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name           │ ID       │ Leader │ Current │ Online │ Active │ Lag │
├────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ DKCPH-VLCDNAT1 │ MTG7nqV6 │        │ false   │ true   │ 46.07s │ 0   │
│ DKCPH-VLCDNAT2 │ 9zdjoip0 │ yes    │ true    │ true   │ 0.00s  │ 0   │
│ DKCPH-VLCDNAT3 │ iNksFxp1 │        │ true    │ true   │ 0.07s  │ 0   │
╰────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯

# nats context select dev24
NATS Configuration Context "dev24"

      Server URLs: nats://dkcph-vlcdnat1:4222,nats://dkcph-vlcdnat2:4222,nats://dkcph-vlcdnat3:4222
         Username: dev24
         Password: *********
            Token: dev24
             Path: C:\Users\jhans\.config\nats\context\dev24.json

# nats stream ls -a
╭───────────────────────────────────────────────────────────────────────────────────────╮
│                                        Streams                                        │
├───────────────┬─────────────┬─────────────────────┬──────────┬─────────┬──────────────┤
│ Name          │ Description │ Created             │ Messages │ Size    │ Last Message │
├───────────────┼─────────────┼─────────────────────┼──────────┼─────────┼──────────────┤
│ cache-updates │             │ 2023-01-16 11:30:48 │ 1,190    │ 417 KiB │ 2.99s        │
╰───────────────┴─────────────┴─────────────────────┴──────────┴─────────┴──────────────╯

# nats stream info cache-updates
Information for Stream cache-updates created 2023-01-16 11:30:48

             Subjects: cache.update.>
             Replicas: 3
              Storage: Memory

Options:

            Retention: Limits
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 30m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited


Cluster Information:

                 Name: nats-cluster-dev
               Leader: DKCPH-VLCDNAT3
              Replica: DKCPH-VLCDNAT1, outdated, seen 2m28s ago, 99 operations behind
              Replica: DKCPH-VLCDNAT2, current, seen 0.53s ago

State:

             Messages: 1,190
                Bytes: 417 KiB
             FirstSeq: 17,516 @ 2023-01-16T18:13:39 UTC
              LastSeq: 18,705 @ 2023-01-16T18:43:29 UTC
     Active Consumers: 1
   Number of Subjects: 2

Here's the ouput from the server startup:

[8156] 2023/01/16 19:51:52.679909 [[32mINF[0m] Starting nats-server
[8156] 2023/01/16 19:51:52.688240 [[32mINF[0m]   Version:  2.9.11
[8156] 2023/01/16 19:51:52.688773 [[32mINF[0m]   Git:      [23ffc16]
[8156] 2023/01/16 19:51:52.690692 [[32mINF[0m]   Cluster:  nats-cluster-dev
[8156] 2023/01/16 19:51:52.692667 [[32mINF[0m]   Name:     DKCPH-VLCDNAT1
[8156] 2023/01/16 19:51:52.693224 [[32mINF[0m]   Node:     MTG7nqV6
[8156] 2023/01/16 19:51:52.693747 [[32mINF[0m]   ID:       NA5EQCP4QDSE6TRDDI33BBFW6DRSIM4HVVUF67WJ5KZLFGEXQTFUJ6H2
[8156] 2023/01/16 19:51:52.694268 [[0;93mWRN[0m] Plaintext passwords detected, use nkeys or bcrypt
[8156] 2023/01/16 19:51:52.694790 [[32mINF[0m] Using configuration file: d:\nats-server\config\server.conf
[8156] 2023/01/16 19:51:52.700736 [[32mINF[0m] Starting http monitor on DKCPH-VLCDNAT1:8222
[8156] 2023/01/16 19:51:52.701243 [[32mINF[0m] Starting JetStream
[8156] 2023/01/16 19:51:52.701911 [[32mINF[0m]     _ ___ _____ ___ _____ ___ ___   _   __  __
[8156] 2023/01/16 19:51:52.702301 [[32mINF[0m]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[8156] 2023/01/16 19:51:52.702836 [[32mINF[0m] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[8156] 2023/01/16 19:51:52.703365 [[32mINF[0m]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[8156] 2023/01/16 19:51:52.703893 [[32mINF[0m]
[8156] 2023/01/16 19:51:52.707820 [[32mINF[0m]          https://docs.nats.io/jetstream
[8156] 2023/01/16 19:51:52.711112 [[32mINF[0m]
[8156] 2023/01/16 19:51:52.711626 [[32mINF[0m] ---------------- JETSTREAM ----------------
[8156] 2023/01/16 19:51:52.712461 [[32mINF[0m]   Max Memory:      3.00 GB
[8156] 2023/01/16 19:51:52.712972 [[32mINF[0m]   Max Storage:     30.00 GB
[8156] 2023/01/16 19:51:52.713795 [[32mINF[0m]   Store Directory: "d:\nats-server\data\jetstream"
[8156] 2023/01/16 19:51:52.714305 [[32mINF[0m] -------------------------------------------
[8156] 2023/01/16 19:51:52.722670 [[32mINF[0m] Starting JetStream cluster
[8156] 2023/01/16 19:51:52.724497 [[32mINF[0m] Creating JetStream metadata controller
[8156] 2023/01/16 19:51:52.726441 [[32mINF[0m] JetStream cluster recovering state
[8156] 2023/01/16 19:51:52.728538 [[32mINF[0m] Listening for client connections on DKCPH-VLCDNAT1:4222
[8156] 2023/01/16 19:51:52.729895 [[32mINF[0m] Server is ready
[8156] 2023/01/16 19:51:52.730406 [[32mINF[0m] Cluster name is nats-cluster-dev
[8156] 2023/01/16 19:51:52.731279 [[32mINF[0m] Listening for route connections on DKCPH-VLCDNAT1:6222
[8156] 2023/01/16 19:51:52.738608 [[32mINF[0m] 10.110.53.112:6222 - rid:71 - Route connection created
[8156] 2023/01/16 19:51:52.747038 [[32mINF[0m] 10.110.53.113:6222 - rid:72 - Route connection created
[8156] 2023/01/16 19:51:52.846302 [[0;93mWRN[0m] Waiting for routing to be established...
[8156] 2023/01/16 19:51:53.025610 [[32mINF[0m] 10.110.53.113:49823 - rid:73 - Route connection created
[8156] 2023/01/16 19:51:53.031251 [[32mINF[0m] 10.110.53.113:49823 - rid:73 - Router connection closed: Duplicate Route
[8156] 2023/01/16 19:51:53.113164 [[32mINF[0m] 10.110.53.112:57498 - rid:74 - Route connection created
[8156] 2023/01/16 19:51:53.115043 [[32mINF[0m] 10.110.53.112:57498 - rid:74 - Router connection closed: Duplicate Route
[8156] 2023/01/16 19:51:53.282989 [[0;93mWRN[0m] Error applying entries to 'DEV24 > cache-updates': invalid character '\x16' looking for beginning of value
panic: JetStream Cluster Unknown group entry op type!

goroutine 93 [running]:
github.com/nats-io/nats-server/v2/server.(*jetStream).applyStreamEntries(0xc0002ac000, 0xc000079c00, 0x0?, 0x0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:2464 +0x1e1a
github.com/nats-io/nats-server/v2/server.(*jetStream).monitorStream(0xc0002ac000, 0xc000079c00, 0xc0002f9290, 0x0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1902 +0x26e5
github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateStream.func1()
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3132 +0x2a
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3083 +0x85

Here's the server info from before I took DKCPH-VLCDNAT1 down:

# nats server ls
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                    Server Overview                                                                     │
├────────────────┬──────────────────┬────────────────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬────────────┬──────┤
│ Name           │ Cluster          │ Host           │ Version │ JS  │ Conns │ Subs  │ Routes │ GWs │ Mem     │ CPU % │ Cores │ Slow │ Uptime     │ RTT  │
├────────────────┼──────────────────┼────────────────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼──────┤
│ DKCPH-VLCDNAT3 │ nats-cluster-dev │ DKCPH-VLCDNAT3 │ 2.9.11  │ yes │ 3     │ 554   │ 2      │ 0   │ 52 MiB  │ 13    │ 2     │ 0    │ 16h14m45s  │ 8ms  │
│ DKCPH-VLCDNAT1 │ nats-cluster-dev │ DKCPH-VLCDNAT1 │ 2.9.11  │ yes │ 7     │ 558   │ 2      │ 0   │ 54 MiB  │ 9     │ 2     │ 0    │ 9h4m31s    │ 8ms  │
│ DKCPH-VLCDNAT2 │ nats-cluster-dev │ DKCPH-VLCDNAT2 │ 2.9.11  │ yes │ 4     │ 554   │ 2      │ 0   │ 77 MiB  │ 12    │ 2     │ 0    │ 6d5h42m29s │ 10ms │
├────────────────┼──────────────────┼────────────────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼────────────┼──────┤
│                │ 1                │ 3              │         │ 3   │ 14    │ 1,666 │        │     │ 182 MiB │       │       │ 0    │            │      │
╰────────────────┴──────────────────┴────────────────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴────────────┴──────╯

╭─────────────────────────────────────────────────────────────────────────────────────╮
│                                  Cluster Overview                                   │
├──────────────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster          │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├──────────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ nats-cluster-dev │ 3          │ 0                 │ 0                 │ 14          │
├──────────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│                  │ 3          │ 0                 │ 0                 │ 14          │
╰──────────────────┴────────────┴───────────────────┴───────────────────┴─────────────╯

# nats server report jetstream
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                 JetStream Summary                                                  │
├─────────────────┬──────────────────┬─────────┬───────────┬──────────┬─────────┬─────────┬──────┬─────────┬─────────┤
│ Server          │ Cluster          │ Streams │ Consumers │ Messages │ Bytes   │ Memory  │ File │ API Req │ API Err │
├─────────────────┼──────────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────┼─────────┼─────────┤
│ DKCPH-VLCDNAT3  │ nats-cluster-dev │ 6       │ 6         │ 9,336    │ 3.2 MiB │ 3.2 MiB │ 0 B  │ 20      │ 0       │
│ DKCPH-VLCDNAT1  │ nats-cluster-dev │ 6       │ 6         │ 6,950    │ 2.3 MiB │ 2.3 MiB │ 0 B  │ 22      │ 0       │
│ DKCPH-VLCDNAT2* │ nats-cluster-dev │ 6       │ 6         │ 9,336    │ 3.2 MiB │ 3.2 MiB │ 0 B  │ 31,161  │ 30,973  │
├─────────────────┼──────────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────┼─────────┼─────────┤
│                 │                  │ 18      │ 18        │ 25,622   │ 8.8 MiB │ 8.8 MiB │ 0 B  │ 31,203  │ 30,973  │
╰─────────────────┴──────────────────┴─────────┴───────────┴──────────┴─────────┴─────────┴──────┴─────────┴─────────╯

╭──────────────────────────────────────────────────────────────────────╮
│                     RAFT Meta Group Information                      │
├────────────────┬──────────┬────────┬─────────┬────────┬────────┬─────┤
│ Name           │ ID       │ Leader │ Current │ Online │ Active │ Lag │
├────────────────┼──────────┼────────┼─────────┼────────┼────────┼─────┤
│ DKCPH-VLCDNAT1 │ MTG7nqV6 │        │ true    │ true   │ 0.48s  │ 0   │
│ DKCPH-VLCDNAT2 │ 9zdjoip0 │ yes    │ true    │ true   │ 0.00s  │ 0   │
│ DKCPH-VLCDNAT3 │ iNksFxp1 │        │ true    │ true   │ 0.48s  │ 0   │
╰────────────────┴──────────┴────────┴─────────┴────────┴────────┴─────╯

I'm also getting this error sometimes when starting mentioning corrupt state in a consumer snapshot:

[6736] 2023/01/16 19:50:07.100929 [[32mINF[0m] Starting nats-server
[6736] 2023/01/16 19:50:07.101451 [[32mINF[0m]   Version:  2.9.11
[6736] 2023/01/16 19:50:07.102058 [[32mINF[0m]   Git:      [23ffc16]
[6736] 2023/01/16 19:50:07.102650 [[32mINF[0m]   Cluster:  nats-cluster-dev
[6736] 2023/01/16 19:50:07.103189 [[32mINF[0m]   Name:     DKCPH-VLCDNAT1
[6736] 2023/01/16 19:50:07.103724 [[32mINF[0m]   Node:     MTG7nqV6
[6736] 2023/01/16 19:50:07.104254 [[32mINF[0m]   ID:       NAVTCLEETMODB35IPM4TSBOVB5XF55ATNIZPJCC3IKM3CKNDYR2XVD4H
[6736] 2023/01/16 19:50:07.104777 [[0;93mWRN[0m] Plaintext passwords detected, use nkeys or bcrypt
[6736] 2023/01/16 19:50:07.105341 [[32mINF[0m] Using configuration file: d:\nats-server\config\server.conf
[6736] 2023/01/16 19:50:07.114807 [[32mINF[0m] Starting http monitor on DKCPH-VLCDNAT1:8222
[6736] 2023/01/16 19:50:07.115381 [[32mINF[0m] Starting JetStream
[6736] 2023/01/16 19:50:07.117983 [[32mINF[0m]     _ ___ _____ ___ _____ ___ ___   _   __  __
[6736] 2023/01/16 19:50:07.118494 [[32mINF[0m]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[6736] 2023/01/16 19:50:07.120273 [[32mINF[0m] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[6736] 2023/01/16 19:50:07.121174 [[32mINF[0m]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[6736] 2023/01/16 19:50:07.121687 [[32mINF[0m]
[6736] 2023/01/16 19:50:07.122539 [[32mINF[0m]          https://docs.nats.io/jetstream
[6736] 2023/01/16 19:50:07.123050 [[32mINF[0m]
[6736] 2023/01/16 19:50:07.123860 [[32mINF[0m] ---------------- JETSTREAM ----------------
[6736] 2023/01/16 19:50:07.124370 [[32mINF[0m]   Max Memory:      3.00 GB
[6736] 2023/01/16 19:50:07.125715 [[32mINF[0m]   Max Storage:     30.00 GB
[6736] 2023/01/16 19:50:07.128959 [[32mINF[0m]   Store Directory: "d:\nats-server\data\jetstream"
[6736] 2023/01/16 19:50:07.129473 [[32mINF[0m] -------------------------------------------
[6736] 2023/01/16 19:50:07.138807 [[32mINF[0m] Starting JetStream cluster
[6736] 2023/01/16 19:50:07.138807 [[32mINF[0m] Creating JetStream metadata controller
[6736] 2023/01/16 19:50:07.140170 [[32mINF[0m] JetStream cluster recovering state
[6736] 2023/01/16 19:50:07.140678 [[32mINF[0m] Listening for client connections on DKCPH-VLCDNAT1:4222
[6736] 2023/01/16 19:50:07.146230 [[32mINF[0m] Server is ready
[6736] 2023/01/16 19:50:07.150521 [[32mINF[0m] Cluster name is nats-cluster-dev
[6736] 2023/01/16 19:50:07.160689 [[32mINF[0m] Listening for route connections on DKCPH-VLCDNAT1:6222
[6736] 2023/01/16 19:50:07.166618 [[32mINF[0m] 10.110.53.113:6222 - rid:71 - Route connection created
[6736] 2023/01/16 19:50:07.167700 [[32mINF[0m] 10.110.53.112:6222 - rid:72 - Route connection created
[6736] 2023/01/16 19:50:07.264781 [[0;93mWRN[0m] Waiting for routing to be established...
[6736] 2023/01/16 19:50:07.301621 [[32mINF[0m] JetStream cluster new metadata leader: DKCPH-VLCDNAT2/nats-cluster-dev
[6736] 2023/01/16 19:50:07.304642 [[31mERR[0m] JetStream cluster could not decode consumer snapshot for 'DEV24 > cache-updates > dkcph-vlcdapp24-velocityhost-consumer' [C-R3M-xRnu743F]
panic: corrupt state file

goroutine 50 [running]:
github.com/nats-io/nats-server/v2/server.(*jetStream).applyConsumerEntries(0xc0000fe000, 0xc0002fe480, 0x0?, 0x0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:4110 +0x85a
github.com/nats-io/nats-server/v2/server.(*jetStream).monitorConsumer(0xc0000fe000, 0xc0002fe480, 0xc0002accf0)
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3989 +0xf76
github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateConsumer.func1()
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3699 +0x25
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
        /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3083 +0x85

JohannesEH avatar Jan 16 '23 18:01 JohannesEH

Note there were also a few other accounts that were malfunctioning, but not DEV19 this time.

JohannesEH avatar Jan 16 '23 18:01 JohannesEH

All the accounts are set up exactly the same.

JohannesEH avatar Jan 16 '23 18:01 JohannesEH

Why is server DKCPH-VLCDNAT1 offline?

derekcollison avatar Jan 16 '23 19:01 derekcollison

It seems like the raft log has been corrupted.

Try scaling down cache updates to 1 and then backup.

As the account user.

nats stream update cache-updates --replicas 1 -f

Then

nats stream update cache-updates --replicas 3 -f

derekcollison avatar Jan 16 '23 19:01 derekcollison

dkcph-vlcdnat1 was taken offline by me intentionally (could be due to server patching, or a rolling cluster update), and this is provoking the problem. We initially found it doing a rolling update, but it also occurs just by shutting down a node and bringing it back online.

I'm guessing this is a bug and not expected behavior? I mean, how does the raft log get corrupted during a normal shutdown/startup cycle of a node?

JohannesEH avatar Jan 17 '23 07:01 JohannesEH

And another question why does it suddenly work after trying to restart the server, x amount of times?

JohannesEH avatar Jan 17 '23 07:01 JohannesEH

I think upgrading to the latest 2.9.11 is the first order of business. Which you have done, thank you.

Second is make sure you have a recent NATS cli version running and that you have credentials for the system account with a system user.

Once we are there we can engage a bit more.

derekcollison avatar Jan 17 '23 22:01 derekcollison

I have the latest of both nats-server (2.9.11) and nats-cli (0.0.35).

JohannesEH avatar Jan 18 '23 06:01 JohannesEH

The nats server report jetstream command needs a system account so I also have that in place. ;)

JohannesEH avatar Jan 18 '23 06:01 JohannesEH

I guess it doesn't have anything to do with the cli client version, since the problem is showing during either nats-server shutdown or startup.

JohannesEH avatar Jan 18 '23 07:01 JohannesEH

Here's some -DVV logs from nats-server. One where nats-server exits with "panic: corrupt state file" and one where it exits with "panic: JetStream Cluster Unknown group entry op type!" unknown-group-entry-op-type.log corrupt-state-file.log

JohannesEH avatar Jan 18 '23 07:01 JohannesEH

Retrying to start the server on the same cluster multiple times it finally succeeded. Here's the -DVV log from the successful startup including a normal shutdown. succesful-startup.log

JohannesEH avatar Jan 18 '23 07:01 JohannesEH

Note all of these logs were generated without changing anything in the cluster or locally. Sometimes nats-server fails to startup with various error messages, sometimes it succeeds.

JohannesEH avatar Jan 18 '23 07:01 JohannesEH

I'll try to repro this on my local machine, so it will be easier to debug. I'm suspecting it has something to do with multi tenancy as it is not occurring as much on our other clusters with fewer tenants. But of course this is just speculation at this point.

JohannesEH avatar Jan 18 '23 07:01 JohannesEH

The consumer snapshot seems to be corrupt.

Have you tried deleting that consumer and re-creating?

derekcollison avatar Jan 18 '23 22:01 derekcollison

Yes, it doesn't work. Only thing that works is retrying starting the nats-server process. Sometimes it works.

Another thing I noticed is when running nats server report jetstream after I close one of the nodes. The node is not listed as being offline, it is just listed as not being current. Why doesn't raft realize that the server is offline?

Another thing I saw was that the other nodes in the cluster sometimes panic after I try to restart the node I stopped myself. That is very bad. As trying to recover a semi-functional cluster can bring it down entirely! :(

Questions I worry and wonder about:

  • Why does it sometimes work to retry starting the process?
  • Why does it only happen after a shutdown of a node in the cluster, and not while normal operation?
  • If some state is indeed corrupt. How does that happen? Note it happens consistently in our dev cluster even when I totally repave the cluster (i.e. stop all nodes, delete data folders, restart all nodes), I'm unsure if our accept and production clusters experience the same problems.
    • It seems wrong that a consumer should be able to corrupt state in a way that prevents the cluster from recovering.
    • It seems very bad that a startup of a node in a cluster is able to panic the other servers in said cluster.

Is it possible to pay for direct support from you guys? I really feel I'm in over my head. As our production system is currently relying on our nats cluster I'm very worried, that I will make things more unstable as I try to fix this issue. I realize it could be some stupid thing I did, or we have a bad host somehow, but I don't really know what to look for.

JohannesEH avatar Jan 20 '23 08:01 JohannesEH

Yes, it doesn't work. Only thing that works is retrying starting the nats-server process. Sometimes it works.

Another thing I noticed is when running nats server report jetstream after I close one of the nodes. The node is not listed as being offline, it is just listed as not being current. Why doesn't raft realize that the server is offline?

Servers wait for some time to declare a server down. This allows for temporary disruptions etc.

If the server is shutdown gracefully it will signal to the others that it is going offline and that should be reported immediately.

Another thing I saw was that the other nodes in the cluster sometimes panic after I try to restart the node I stopped myself. That is very bad. As trying to recover a semi-functional cluster can bring it down entirely! :(

Can you share what is being printed out when the server's panic? And double checking you are running the current server, 2.9.11 yes?

Questions I worry and wonder about:

  • Why does it sometimes work to retry starting the process?

Unclear at this time, we would need to do a formal triage with Zoom or something to properly triage.

  • Why does it only happen after a shutdown of a node in the cluster, and not while normal operation?

Unclear, please share what is printed out when the others panic.

  • If some state is indeed corrupt. How does that happen? Note it happens consistently in our dev cluster even when I totally repave the cluster (i.e. stop all nodes, delete data folders, restart all nodes), I'm unsure if our accept and production clusters experience the same problems.

This is something we have not seen, so we would want to get a support call scheduled and do a Zoom and get access to more of the internals of your system to diagnose.

  • It seems wrong that a consumer should be able to corrupt state in a way that prevents the cluster from recovering.

Agree.

  • It seems very bad that a startup of a node in a cluster is able to panic the other servers in said cluster.

Again agree. On same page, but we need much more information to help out. These are very complicated systems under the covers, even if they appear simple from the outside.

Is it possible to pay for direct support from you guys? I really feel I'm in over my head. As our production system is currently relying on our nats cluster I'm very worried, that I will make things more unstable as I try to fix this issue. I realize it could be some stupid thing I did, or we have a bad host somehow, but I don't really know what to look for.

Yes send me an email and I will connect you with the right folks, but let's get a Zoom calendar meeting scheduled regardless asap. (derek at synadia dot com)

derekcollison avatar Jan 20 '23 19:01 derekcollison

@derekcollison thanks for your help! Just to lighten the mood a bit, I think it's really great that you take the time to try and understand what is most likely something stupid we did on our end. I really like nats-server as a product, and again thanks for your assistance so far. :) I agree that the product is complicated, and it is not helping that the way the problem is manifesting itself is very confusing, and in several different ways, :)

Before I get back to you regarding a meeting I just have to talk to some colleagues on Monday. I'll get back to you ASAP, so we can try and set up a meeting. I'll also try one more time to make a local repro of the problem.

Unfortunately, I don't have the logs from the other servers when they panicked. It's hard to provoke that particular manifestation of the issue, and every time the dev cluster is down I disrupt the entire team of developers that are using the cluster for their test environments. I try to avoid those disruptions as much as I can. If I had a local repro of the problem it would be much faster/easier to debug it and give you the steps to repro the problem by yourself.

I'll try to contact you on Monday. Thanks again!

JohannesEH avatar Jan 20 '23 20:01 JohannesEH