nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

Inconsistent reads for R3 KV

Open cbrewster opened this issue 8 months ago • 12 comments

Observed behavior

We have an R3 KV and we recently upgraded to NATS 2.10.3 from 2.9. We often use CAS operations to make updates to keys in the bucket. We'll read a key, make modifications, then update the key with the revision that was originally read. This has been working great until we upgraded to 2.10 and we started to see errors like nats: wrong last sequence: 835882. We also started to notice other inconsistent behavior on systems that read from the KV.

When investigating further we were able to see inconsistent reads from the same NATS server for the same KV key. Here's an example of what I was observing:

$ nats req '$JS.API.DIRECT.GET.KV_deployments.$KV.deployments.6fa09715-c135-4cb0-8b7b-b7e5e49075b1' ''
19:58:23 Sending request on "$JS.API.DIRECT.GET.KV_deployments.$KV.deployments.6fa09715-c135-4cb0-8b7b-b7e5e49075b1"
19:58:24 Received with rtt 43.167285ms
19:58:24 Nats-Stream: KV_deployments
19:58:24 Nats-Subject: $KV.deployments.6fa09715-c135-4cb0-8b7b-b7e5e49075b1
19:58:24 Nats-Sequence: 835937
19:58:24 Nats-Time-Stamp: 2023-10-25T21:06:45.163089148Z
19:58:24 Nats-Expected-Last-Subject-Sequence: 832643

<message redacted>

$ nats req '$JS.API.DIRECT.GET.KV_deployments.$KV.deployments.6fa09715-c135-4cb0-8b7b-b7e5e49075b1' ''
19:58:27 Sending request on "$JS.API.DIRECT.GET.KV_deployments.$KV.deployments.6fa09715-c135-4cb0-8b7b-b7e5e49075b1"
19:58:27 Received with rtt 41.753095ms
19:58:27 Nats-Time-Stamp: 2023-10-25T23:17:08.937511603Z
19:58:27 Nats-Expected-Last-Subject-Sequence: 836168
19:58:27 Nats-Stream: KV_deployments
19:58:27 Nats-Subject: $KV.deployments.6fa09715-c135-4cb0-8b7b-b7e5e49075b1
19:58:27 Nats-Sequence: 836176

<message redacted>

While this view doesn't show the servers, we have another command which provides a UI on top of NATS for our use-case and it was able to demonstrate this inconsistency within the same server.

$ deployctl deployment get 6fa09715-c135-4cb0-8b7b-b7e5e49075b1
server: nats-0
revision: 835937 2023-10-25 16:06:45.163089148 -0500 CDT
...

$ deployctl deployment get 6fa09715-c135-4cb0-8b7b-b7e5e49075b1
server: nats-0
revision: 836176 2023-10-25 18:17:08.937511603 -0500 CDT
...

This only started after we upgraded to 2.10.x earlier this week so we think this is likely a regression.

We tried to roll back to 2.9.23, but it seems that there's some data format inconsistencies that prevented us from doing so:

[18] 2023/10/26 01:11:01.715960 [WRN] Error applying entries to 'A... > KV_deployments': invalid character '*' looking for beginning of value
[18] 2023/10/26 01:11:01.716092 [WRN] Error applying entries to 'A... > KV_deployments': invalid character '*' looking for beginning of value

This has been causing a lot of weird errors for us since the data can be stale for many hours. We've seen a NATS server restart help the problem but it doesn't prevent the issue from happening again.

KV Config:

Information for Stream KV_deployments created 2023-03-30 09:38:22

             Subjects: $KV.deployments.>
             Replicas: 3
              Storage: File

Options:

            Retention: Limits
     Acknowledgements: true
       Discard Policy: New
     Duplicate Window: 2m0s
           Direct Get: true
    Allows Msg Delete: false
         Allows Purge: true
       Allows Rollups: true

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: 1
        Maximum Bytes: unlimited
          Maximum Age: unlimited
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited


Cluster Information:

                 Name: deployments-us-central1
               Leader: nats-2
              Replica: nats-0, current, seen 0.21s ago
              Replica: nats-1, current, seen 0.21s ago

State:

             Messages: 10,040
                Bytes: 2.4 MiB
             FirstSeq: 367 @ 2023-03-30T15:06:59 UTC
              LastSeq: 836,276 @ 2023-10-26T02:18:41 UTC
     Deleted Messages: <redacted>
     Active Consumers: 4
   Number of Subjects: <redacted>

Expected behavior

We'd expect the same NATS server to respond with consistent values after being written and not show stale values minutes/hours after they were updated.

Server and client version

Server: 2.10.3

Host environment

We're running NATS on GKE using the 0.1.x helm chart

Steps to reproduce

Unfortunately we do not have exact steps to reproduce at the moment, but we will attempt to find a reproduction soon. Help before then would be appreciated. Either to fix forward on 2.10.x or to be able to roll back to 2.9.x until the fix is in place.

cbrewster avatar Oct 26 '23 02:10 cbrewster