nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

JetStream does not survive after "stale NFS file handle" error

Open dpotapov opened this issue 2 years ago • 27 comments

Defect

I'm running NATS cluster on AWS Fargate. For ~2 months it was running ok (NATS v2.6.6), but 2 days ago it suffered storage issue with EFS backend. JetStream became unavailable. I've tried to upgrade NATS to a recent version (v2.7.0) in a hope it would automatically heal itself, but it did not.

Versions of nats-server and affected client libraries used:

  • NATS 2.6.6 or NATS 2.7.0

OS/Container environment:

  • AWS Fargate with efs-volume as storage backend
  • For the reproduces I'm using a CentOS 7 VM

nats-server -DV:

[20215] 2022/01/16 19:44:28.561325 [INF] Starting nats-server
[20215] 2022/01/16 19:44:28.561508 [INF]   Version:  2.7.0
[20215] 2022/01/16 19:44:28.561512 [INF]   Git:      [not set]
[20215] 2022/01/16 19:44:28.561514 [DBG]   Go build: go1.17.6
[20215] 2022/01/16 19:44:28.561518 [INF]   Name:     NBME2FKEOX72T5L6VUKYUEXKFIWIRUBXAQHDUTJHH3XOBQJUOIQ4CP6Q
[20215] 2022/01/16 19:44:28.561523 [INF]   ID:       NBME2FKEOX72T5L6VUKYUEXKFIWIRUBXAQHDUTJHH3XOBQJUOIQ4CP6Q
[20215] 2022/01/16 19:44:28.561547 [DBG] Created system account: "$SYS"
[20215] 2022/01/16 19:44:28.561910 [INF] Listening for client connections on 0.0.0.0:4222
[20215] 2022/01/16 19:44:28.561917 [DBG] Get non local IPs for "0.0.0.0"
[20215] 2022/01/16 19:44:28.562071 [DBG]   ip=198.61.70.28
[20215] 2022/01/16 19:44:28.562097 [DBG]   ip=172.17.0.1
[20215] 2022/01/16 19:44:28.562108 [INF] Server is ready

Steps or code to reproduce the issue:

  1. Install NFS server (here and after a CentOS 7 VM is used): yum install nfs-utils

  2. Configure /etc/exports and apply changes withsystemctl restart nfs:

    /tmp/nats localhost(rw,no_root_squash)
    
  3. mount NFS:

    mount -t nfs -o hard localhost:/tmp/nats /mnt/nfsnats
    
  4. Start JetStream cluster (run each command in a new shell)

    nats-server -p 4222 -cluster nats://localhost:4248 --cluster_name test-cluster -routes nats://localhost:4248,nats://localhost:5248,nats://localhost:6248 -js -sd /mnt/nfsnats/natsA --name natsA
    nats-server -p 5222 -cluster nats://localhost:5248 --cluster_name test-cluster -routes nats://localhost:4248,nats://localhost:5248,nats://localhost:6248 -js -sd /mnt/nfsnats/natsB --name natsB
    nats-server -p 6222 -cluster nats://localhost:6248 --cluster_name test-cluster -routes nats://localhost:4248,nats://localhost:5248,nats://localhost:6248 -js -sd /mnt/nfsnats/natsC --name natsC
    
  5. Create a stream and start ingesting data:

    nats str add --replicas 3 --subjects data --storage file DATA
    # all other options with default values
    i=0; while true; do ((i++)); nats pub data $i; sleep 0.3; done
    
  6. Simulate NFS file handle error by "breaking" a mount point of a leader server for DATA stream

    cp -r /tmp/nats /tmp/nats.backup
    rm -rf /tmp/nats
    mv /tmp/nats.backup /tmp/nats
    
  7. Restart NFS

systemctl restart nfs
  1. JetStream shuts down:
[6229] 2022/01/16 20:19:39.416387 [ERR] JetStream failed to store a msg on stream '$G > DATA': write /mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk: bad file descriptor
[6229] 2022/01/16 20:19:39.737394 [ERR] JetStream failed to store a msg on stream '$G > DATA': error opening msg block file ["/mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk"]: open /mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk: stale NFS file handle
[6229] 2022/01/16 20:19:40.068785 [ERR] JetStream failed to store a msg on stream '$G > DATA': error opening msg block file ["/mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk"]: open /mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk: stale NFS file handle
[6229] 2022/01/16 20:19:40.401472 [ERR] JetStream failed to store a msg on stream '$G > DATA': error opening msg block file ["/mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk"]: open /mnt/nfs/nats/natsB/jetstream/$G/streams/DATA/msgs/1.blk: stale NFS file handle
[6229] 2022/01/16 20:19:44.316535 [ERR] RAFT [wURfHZ9N - _meta_] Critical write error: open /mnt/nfs/nats/natsB/jetstream/$SYS/_js_/_meta_/tav.idx: stale NFS file handle
[6229] 2022/01/16 20:19:44.316564 [WRN] RAFT [wURfHZ9N - _meta_] Error writing term and vote file for "_meta_": open /mnt/nfs/nats/natsB/jetstream/$SYS/_js_/_meta_/tav.idx: stale NFS file handle
[6229] 2022/01/16 20:19:44.316846 [ERR] JetStream out of resources, will be DISABLED
[6229] 2022/01/16 20:19:44.317655 [ERR] RAFT [wURfHZ9N - _meta_] Critical write error: open /mnt/nfs/nats/natsB/jetstream/$SYS/_js_/_meta_/tav.idx: stale NFS file handle
[6229] 2022/01/16 20:19:44.317667 [WRN] RAFT [wURfHZ9N - _meta_] Error writing term and vote file for "_meta_": open /mnt/nfs/nats/natsB/jetstream/$SYS/_js_/_meta_/tav.idx: stale NFS file handle
[6229] 2022/01/16 20:19:45.325688 [INF] Initiating JetStream Shutdown...
[6229] 2022/01/16 20:19:45.325943 [INF] JetStream Shutdown

Any JetStream operation fails with "JetStream system temporarily unavailable"

  1. Fix mountpoint:

    umount -f -l /mnt/nfsnats
    mount -t nfs -o hard localhost:/tmp/nats /mnt/nfsnats
    
  2. Restarting NATS doesn't recover JetStream:

$ nats str ls
No Streams defined

All data appears to be lost. But it is there, in /mnt/nfsdata/... !

Expected result:

JetStream data should be recovered as much as possible.

Actual result:

All JetStream data is no longer available.

dpotapov avatar Jan 16 '22 20:01 dpotapov

UPD: after restoring from backups, the prod cluster was working for awhile and run into "has NO quorum, stalled".

dpotapov avatar Jan 17 '22 15:01 dpotapov

In general we do not recommend persisting JetStream data via a shared filesystem like EFS. Direct or at least attached block storage preferred.

I would backup the import streams and recover them on a new system as described above.

derekcollison avatar Jan 17 '22 16:01 derekcollison

It seems AWS ECS Fargate supports only EFS volumes. My project likely won't justify using EC2 instances + EBS...

Regardless of the storage layer, it feels that the clustering mechanism still requires some attention.

dpotapov avatar Jan 17 '22 17:01 dpotapov

We have a global service that is in all the major CPs and geos that might help.

What aspects of clustering do you think need attention? If the state we store is corrupted or removed, we do try to detect that and respond as best we can. You also have full control of adding and removing peers from the system and from individual assets like streams.

derekcollison avatar Jan 17 '22 23:01 derekcollison

What aspects of clustering do you think need attention?

I'd like to see that NATS restores itself to a working state (assuming there will be a partial data loss).

dpotapov avatar Jan 18 '22 20:01 dpotapov

Understood, but if you have an R2 and you lose that, you are stuck. So I would recommend at least an R3 in that scenario to tolerate a single server loss and be able to recover.

derekcollison avatar Jan 19 '22 00:01 derekcollison

You can drop and add a new peer for R2, but during that process the stream will not function.

derekcollison avatar Jan 19 '22 00:01 derekcollison

Could you please clarify what do you mean by R2/R3? Is it number of replicas per stream or cluster node size?

My cluster has 3 nodes and all streams have 3 replicas.

dpotapov avatar Jan 19 '22 20:01 dpotapov

I think I've found a relevant doc article:

A stream's replication factor (R, often referred to as the number 'Replicas') determines how many places it is stored allowing you to tune to balance risk with resource usage and performance. A stream that is easily rebuilt or temporary might be memory based with a R=1 and a stream that can tolerate some downtime might be file based R-1. Typical usage to operate in typical outages and balance performance would be a filed based stream with R=3. A highly resilient, but less performant and more expensive configuration is R=5, the replication factor limit. Rather than defaulting to the maximum, we suggest selecting the best option based on use case behind the stream. This optimizes resource usage to create a more resilient system at scale. Replicas=1 - Cannot operate during an outage of the server servicing the stream. Highly performant. Replicas=2 - No significant benefit at this time. We recommend using Replicas=3 instead. Replicas=3 - Can tolerate loss of one server servicing the stream. An ideal balance between risk and performance. Replicas=4 - No significant benefit over Replicas=3 except marginally in a 5 node cluster. Replicas=5 - Can tolerate simultaneous loss of two servers servicing the stream. Mitigates risk at the expense of performance.

Since my streams have R=3, it should survive 1 node being down.

So in the case of 2 or more nodes down, it is fair that there are no guarantees. But, assuming happens, is it correct to expect that all data will be lost? (unless you have backups)

dpotapov avatar Jan 19 '22 21:01 dpotapov

Data is only really lost if all copies are lost or corrupted.

derekcollison avatar Jan 20 '22 15:01 derekcollison

That's not what I'm observing. I just did few more experiments.

  1. Start new cluster with 3 nodes
  2. Create a stream with 3 replicas
  3. Put random data in it
  4. Removed data for 2 nodes (so at least 1 copy is not corrupted)
  5. Wait until JetStream is shutdown
  6. Shutdown the remaining node

Now at this point, it all depends on which node comes back first:

  1. if you bring up a node where the data wasn't removed - the cluster will operate just fine and other nodes will pick up data
  2. if you bring up a node where the data was removed - the cluster obv will start as brand-new with no data

For the 2nd scenario, if you run remaining 2 nodes (where one of them has original data), the nats str report will show "No Streams defined". So the original stream won't be picked up automatically. But if add a stream with the same name, strange things will happen. There will be a collision between the original stream on one of the nodes and the new empty stream.

nats stream report command gives different results on each execution:

# nats str report
Obtaining Stream stats

╭────────────────────────────────────────────────────────────────────────────────────────────╮
│                                       Stream Report                                        │
├────────┬─────────┬───────────┬──────────┬─────────┬──────┬─────────┬───────────────────────┤
│ Stream │ Storage │ Consumers │ Messages │ Bytes   │ Lost │ Deleted │ Replicas              │
├────────┼─────────┼───────────┼──────────┼─────────┼──────┼─────────┼───────────────────────┤
│ DATA   │ File    │ 0         │ 61       │ 2.3 KiB │ 0    │ 0       │ nats1!, nats2, nats3* │
╰────────┴─────────┴───────────┴──────────┴─────────┴──────┴─────────┴───────────────────────╯

# nats str report
Obtaining Stream stats

╭────────────────────────────────────────────────────────────────────────────────────────────╮
│                                       Stream Report                                        │
├────────┬─────────┬───────────┬──────────┬────────┬──────┬─────────┬────────────────────────┤
│ Stream │ Storage │ Consumers │ Messages │ Bytes  │ Lost │ Deleted │ Replicas               │
├────────┼─────────┼───────────┼──────────┼────────┼──────┼─────────┼────────────────────────┤
│ DATA   │ File    │ 0         │ 1,854    │ 69 KiB │ 0    │ 0       │ nats1!, nats2!, nats3! │
╰────────┴─────────┴───────────┴──────────┴────────┴──────┴─────────┴────────────────────────╯

# nats str report
Obtaining Stream stats

╭────────────────────────────────────────────────────────────────────────────────────────────╮
│                                       Stream Report                                        │
├────────┬─────────┬───────────┬──────────┬─────────┬──────┬─────────┬───────────────────────┤
│ Stream │ Storage │ Consumers │ Messages │ Bytes   │ Lost │ Deleted │ Replicas              │
├────────┼─────────┼───────────┼──────────┼─────────┼──────┼─────────┼───────────────────────┤
│ DATA   │ File    │ 0         │ 104      │ 3.9 KiB │ 0    │ 0       │ nats1!, nats2, nats3* │
╰────────┴─────────┴───────────┴──────────┴─────────┴──────┴─────────┴───────────────────────╯

# nats str report
Obtaining Stream stats

╭────────────────────────────────────────────────────────────────────────────────────────────╮
│                                       Stream Report                                        │
├────────┬─────────┬───────────┬──────────┬────────┬──────┬─────────┬────────────────────────┤
│ Stream │ Storage │ Consumers │ Messages │ Bytes  │ Lost │ Deleted │ Replicas               │
├────────┼─────────┼───────────┼──────────┼────────┼──────┼─────────┼────────────────────────┤
│ DATA   │ File    │ 0         │ 1,854    │ 69 KiB │ 0    │ 0       │ nats1!, nats2!, nats3! │
╰────────┴─────────┴───────────┴──────────┴────────┴──────┴─────────┴────────────────────────╯

metadata leader reports following in the logs:

[4155] 2022/01/20 17:10:36.139323 [WRN] JetStream cluster stream '$G > DATA' has NO quorum, stalled.
[4155] 2022/01/20 17:10:57.388699 [WRN] JetStream cluster stream '$G > DATA' has NO quorum, stalled.
[4155] 2022/01/20 17:11:17.921937 [WRN] JetStream cluster stream '$G > DATA' has NO quorum, stalled.
[4155] 2022/01/20 17:11:38.481643 [WRN] JetStream cluster stream '$G > DATA' has NO quorum, stalled.
[4155] 2022/01/20 17:12:00.688205 [WRN] JetStream cluster stream '$G > DATA' has NO quorum, stalled.
[4155] 2022/01/20 17:12:22.405500 [WRN] JetStream cluster stream '$G > DATA' has NO quorum, stalled.

dpotapov avatar Jan 20 '22 17:01 dpotapov

If you know data has been lost from a peer, it can be caught up by the current leader and the system will understand that. If you shutdown the complete cluster and certain peers are no longer valid or authoritative, you need to remove them from the system and add in new peers and have the system re-assign the new peers to those assets.

derekcollison avatar Jan 24 '22 16:01 derekcollison

Hi. I think there is some serious bug in HA mode. Our cluster went into broken state while:

  1. Deploying new nats-server version. Using official helm (k8s) chart and latest sources.
  2. Just randomly stopped working few times. Only noticed has NO quorum, stalled errors. This happened with multiple nats-server versions.

In all cases only fresh deployment helped and all data was lost.

@derekcollison I understand that JetStream is still not battle tested and under heavy development. I think it would help if you inform in documentation that JetStream is not production ready yet if high availability is required without data loss. https://github.com/nats-io/nats-streaming-server states that new deployments should use JetStream, but until HA mode is fixed it can't replace old nats-streaming-server.

anjmao avatar Jan 25 '22 11:01 anjmao

@anjmao While I agree JetStream is in active development we have a bunch of folks running in production.

We have noticed a bunch of folks struggling with k8s and NATS. NATS is not meant to really run inside something like k8s which is very opinionated about networking stack etc.

That being said, we are committed to doing all we can for folks to have a good experience within that ecosystem and actually have a new PR that might help with k8s understanding when JetStream itself is in a good spot during server upgrades etc.

For this specific case we need a bunch more info to be helpful.

What server version? What k8s version? What helm char were you using? Is NATS running on the node port or do clients need to go through an ingress controller or another type of layer 7 router like a service mesh? How are you doing persistence? Direct attached or block or network mounts?

derekcollison avatar Jan 25 '22 15:01 derekcollison

might be related https://github.com/nats-io/nack/issues/64

joriaty-ben avatar Mar 21 '22 08:03 joriaty-ben

@derekcollison

We have noticed a bunch of folks struggling with k8s and NATS. NATS is not meant to really run inside something like k8s which is very opinionated about networking stack etc

Sorry, this totally sounds like, Nats Jetstream ist definitely not production ready for k8s. This is funny cause we are actually using it for production and facing similar issues like described above. Is there a hint in the official documentation that Nats Jetstream is not meant to really run inside k8s, didnt find any. Would you then rather suggest Kafka or anything else for production? Serious question!!!

joriaty-ben avatar Mar 21 '22 09:03 joriaty-ben

NATS JetStream is ready for production, and we are working hard with our customers to have a better experience when running inside of K8S.

The areas we want to watch out for are the following.

  1. Ingest controllers and overriding NATS client and server connectivity. We suggest using NodePort.
  2. K8S dns, if you rely on it for multiple A records, etc.
  3. Memory limits and OOM killer. This is not specific to NATS, but more of a Golang programs inside of K8S containers with limits. Was experimenting with a team the other day and tweaking GOGC but also did some work this weekend on reducing NATS w/ JetStream's memory usage to hopefully avoid the OOM all together.

We have also put more work into out helm charts, specifically utilizing the /healthz endpoint for a server when upgrading before moving to the next server.

derekcollison avatar Mar 21 '22 14:03 derekcollison

I keep using NATS v2.7.0 on AWS Fargate containers with EFS backend (I know I know, Nats team doesn't recommend it) and few days ago I run into a clustering issue again. It started with RAFT warning, and end up with "no quorum issue". To recover from this state I'm wiping out all jetstream data and starting a cluster from scratch. I don't bother restoring from backups, because it didn't work for me few times in the past (either replica count issues or just "Jetstream not available") - maybe those issues are fixed already.

2022/03/13 10:08:17.568835 [WRN] RAFT [IW1Cq6kK - C-R3F-SyVXYBkl] Error storing entry to WAL: raft: could not storeentry to WAL
2022/03/13 10:08:17.569718 [WRN] RAFT [y9YYE3xf - C-R3F-SyVXYBkl] Error storing entry to WAL: raft: could not storeentry to WAL
2022/03/13 10:08:17.721094 [WRN] RAFT [IW1Cq6kK - C-R3F-SyVXYBkl] Error storing entry to WAL: raft: could not storeentry to WAL
2022/03/13 10:08:17.721111 [WRN] RAFT [IW1Cq6kK - C-R3F-SyVXYBkl] Expected first catchup entry to be a snapshot and peerstate, will retry
2022/03/13 10:08:17.721184 [WRN] RAFT [y9YYE3xf - C-R3F-SyVXYBkl] Error storing entry to WAL: raft: could not storeentry to WAL
2022/03/13 10:08:17.721202 [WRN] RAFT [y9YYE3xf - C-R3F-SyVXYBkl] Expected first catchup entry to be a snapshot and peerstate, will retry

The issue like above is not tied to AWS Fargate or EFS. Per my experiments, it may occur whenever all instances of NATS servers are suffering from the outage at the same time. Imagine a cluster is deployed in a single datacenter (k8s, docker containers, or standalone binaries on dedicated physical machines - doesn't matter) and potentially a global power outage can make a production cluster unusable.

I'd really like to see efforts improving the cluster recovery (with some partial data loss if necessary).

dpotapov avatar Mar 21 '22 19:03 dpotapov

If you can please upgrade to at least 2.7.4. JetStream is rapidly improving and best to stay as current as possible.

The upcoming 2.7.5 also has some improvements so if kicking the tires or doing testing you can use our nightly built docker image at syndia/nats-server:nightly.

derekcollison avatar Mar 21 '22 19:03 derekcollison

We are experiencing a similar issue. Here are our logs:

[7] 2022/04/12 04:25:16.853868 [WRN] RAFT [Qe10nf2S - C-R3F-xWj36Zjd] Expected first catchup entry to be a snapshot and peerstate, will retry
[7] 2022/04/12 04:25:16.854512 [WRN] RAFT [Qe10nf2S - C-R3F-xWj36Zjd] Error storing entry to WAL: raft: could not storeentry to WAL
[7] 2022/04/12 04:25:16.854692 [WRN] RAFT [Qe10nf2S - C-R3F-xWj36Zjd] Expected first catchup entry to be a snapshot and peerstate, will retry
[7] 2022/04/12 04:25:16.855359 [WRN] RAFT [Qe10nf2S - C-R3F-xWj36Zjd] Error storing entry to WAL: raft: could not storeentry to WAL
[7] 2022/04/12 04:25:16.855384 [WRN] RAFT [Qe10nf2S - C-R3F-xWj36Zjd] Expected first catchup entry to be a snapshot and peerstate, will retry

Is there a way to recover from this? We tried restarting the pod. That did not help. Only one of the 3 pods are showing log like this.

Update: We updated nats to 2.7.4-alpine. Still the same error.

Update 2: After deleting PVC for the offending pod and restarting the pod, things seem to be stable.

tamalsaha avatar Apr 12 '22 04:04 tamalsaha

I also think that jetstream storage high availability has some drawbacks. if data backup cannot be IO level replication ,nodes cannot ensure data consistency . But data backup is IO level replication and it is impossible to ensure high performance. data consistency cannot be ensured, the new leader service should not attempt to recover from the leader node death. Data chaos is worse than data loss. Jetstream's storage solution is advanced, and the introduction of object storage extends even more functionality. jetstream storage high availability solution do more research.

guojianyu avatar Apr 27 '22 09:04 guojianyu

On version 2.8.1 and experiencing the log flooding as well

[535] 2022/05/02 15:37:09.351378 [WRN] RAFT [cnrtt3eg - C-R3F-C4IFG7Z8] Expected first catchup entry to be a snapshot and peerstate, will retry
nats
[535] 2022/05/02 15:37:09.351751 [WRN] RAFT [cnrtt3eg - C-R3F-C4IFG7Z8] Error storing entry to WAL: raft: could not store entry to WAL
nats
[535] 2022/05/02 15:37:09.351763 [WRN] RAFT [cnrtt3eg - C-R3F-C4IFG7Z8] Expected first catchup entry to be a snapshot and peerstate, will retry
nats
[535] 2022/05/02 15:37:09.352074 [WRN] RAFT [cnrtt3eg - C-R3F-C4IFG7Z8] Error storing entry to WAL: raft: could not store entry to WAL
nats
[535] 2022/05/02 15:37:09.352089 [WRN] RAFT [cnrtt3eg - C-R3F-C4IFG7Z8] Expected first catchup entry to be a snapshot and peerstate, will retry
nats
[535] 2022/05/02 15:37:09.352511 [WRN] RAFT [cnrtt3eg - C-R3F-C4IFG7Z8] Error storing entry to WAL: raft: could not store entry to WAL

Is there a way to avoid the amount of log warnings? I just blew through 50gb of logs in less than 24 hours.

cchatfield avatar May 02 '22 16:05 cchatfield

Sometimes moving the leader will help but in your instance I suggest shutting down the server and removing the directory in question.

<storage_dir>/jetstream/$SYS/_js_/C-R3F-C4IFG7Z8

Then restart the server. This will resolve.

Thanks for your patience.

derekcollison avatar May 02 '22 16:05 derekcollison

Thank you for the explanation. I had deleted the pvc and restarted the pod and the error went away. I will try your suggestion the next time.

@derekcollison is there a way to limit the warning entries in the log? There are multiple per ms and it quickly goes over the logging budget available.

cchatfield avatar May 02 '22 17:05 cchatfield

We have plans to place it in the 1s periodic log interval for 2.8.2 but more importantly need to figure out what triggers it and how it self resolve.

Any more information you could share? Was the consumer active? Did this get triggered by a server update or restart? Anything would be helpful. Thanks..

derekcollison avatar May 02 '22 18:05 derekcollison

[Caveat] I just started using NATS, so definitely not an expert at running. I also know the setup below is not optimal, but it is a dev env and I am saving money with spot vms. Prod will not be spot but still k8s.

I have a node pool with 2 vms in our dev environment. I am running 2.8.1 from a helm chart with 3 replicas in a stateful set. 10G standard (pd-standard) disks are mounted as pvc.

Streams are managed via nack.

The vms are spot instances. I had one drop with two stateful sets (nats-0, nats-1) on it and the cluster came back up fine.

I deleted all the streams (some were memory) and recreated with nack to be file.

I updated the cluster from 2.8.0 -> 2.8.1.

nats-2 came up first as 2.8.1 - went into container restart loop with errors (no quorum, stalled). Restarted 10 times with errors like stream and account is not current and started erroring with could not store entry to WAL.

Get error RAFT [cnrtt3eg - C-R3F-BS1Fy02I] Corrupt WAL, will truncate.

nats-2 Self is new JetStream cluster metadata leader

nats-0, nats-1 come up as 2.8.1

nats-2 continues with WAL errors

Deleted nats-2 pvc and restarted and errors went away

cchatfield avatar May 02 '22 21:05 cchatfield

Thanks, that is helpful.

derekcollison avatar May 02 '22 21:05 derekcollison

Facing similar issue.. RAFT [ZVxVRk09 - _meta_] Error storing entry to WAL: raft: could not store entry to WAL RAFT [ZVxVRk09 - _meta_] Expected first catchup entry to be a snapshot and peerstate, will retry

@derekcollison is there a way to limit the entries to WAL for meta in 2.8.4? And Is the above resolved? Can you point to the fix?

vishaltripathi24 avatar Jan 10 '23 18:01 vishaltripathi24

Please upgrade to 2.9.11

derekcollison avatar Jan 10 '23 19:01 derekcollison

2.8.x is no longer supported.

derekcollison avatar Jan 10 '23 19:01 derekcollison