nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

JetStream KV corruption on Windows with reboot after KV creation [v2.10.24]

Open ckasabula opened this issue 10 months ago • 6 comments

Observed behavior

[6836] 2025/01/21 13:08:26.450317 [[0;93mWRN[0m] Filestore [KV_foo] indexCacheBuf corrupt record state: dlen 1166566284 slen 54266 index 0 rl 1166566306 lbuf 508
[6836] 2025/01/21 13:08:55.031990 [[0;93mWRN[0m] Filestore [KV_foo] indexCacheBuf corrupt state: mb.first 10 mb.last 0
[6836] 2025/01/21 13:08:55.031990 [[0;93mWRN[0m] Filestore [KV_foo] indexCacheBuf corrupt state: mb.first 10 mb.last 0
[6836] 2025/01/21 13:08:55.032560 [[0;93mWRN[0m] Filestore [KV_foo] indexCacheBuf corrupt state: mb.first 10 mb.last 0
[6836] 2025/01/21 13:08:55.032560 [[0;93mWRN[0m] Filestore [KV_foo] indexCacheBuf corrupt state: mb.first 10 mb.last 0
[6836] 2025/01/21 13:08:55.033180 [[0;93mWRN[0m] Filestore [KV_foo] indexCacheBuf corrupt state: mb.first 10 mb.last 0
[6836] 2025/01/21 13:08:55.033180 [[0;93mWRN[0m] Filestore [KV_foo] indexCacheBuf corrupt state: mb.first 10 mb.last 0
[6836] 2025/01/21 13:08:55.033751 [[0;93mWRN[0m] Filestore [KV_foo] indexCacheBuf corrupt state: mb.first 10 mb.last 0
[6836] 2025/01/21 13:08:55.033751 [[0;93mWRN[0m] Filestore [KV_foo] indexCacheBuf corrupt state: mb.first 10 mb.last 0
[6836] 2025/01/21 13:08:55.039885 [[31mERR[0m] JetStream failed to store a msg on stream '$G > KV_foo': corrupt state file

Corruption is corrected on next startup but with data loss. Restore loses values.

Expected behavior

No corruption

Server and client version

nats-server: v2.10.24
nats-cli: 0.1.6

Host environment

Windows Server 2019 Standard Windows Server 2022 Standard

Steps to reproduce

nats-server config:

host: 0.0.0.0
port: 4222
debug: true

jetstream {
  store_dir: "C:\\nats_test"
  cipher: "chachapoly",
  key: "ScxCcMDcemw8COVUtPVsdfLMRLG1PGpj"
}
  • nats-server configured as a windows service or running as an app
  • jetstream folder not created (first run)
  • start nats-server
  • nats kv add foo
  • reboot (if service) or <ctrl>c nats-server and reboot (if app)
> nats kv put foo test 1
1
> nats kv put foo test 2
2
> nats kv put foo test 3
3
> nats kv put foo test 4
4
> nats kv put foo test 5
5
> nats kv put foo test 6
6
> nats kv put foo test 7
7
> nats kv put foo test 8
8
> nats kv put foo test 9
9
> nats kv put foo test 10
10
> nats kv del foo test
? Delete key foo > test? Yes
> nats kv put foo test 10
nats: error: nats: corrupt state file
>

NOTE: if you write to KV before the reboot, corruption doesn't happen. This is our work-around. Repeatable.

ckasabula avatar Jan 21 '25 21:01 ckasabula

This is an issue for jetstream work-queues too.

ckasabula avatar Jan 22 '25 18:01 ckasabula

There is a setting sync: always that could potentially help here. This issue also looks similar https://github.com/nats-io/nats-server/issues/5412

wallyqs avatar Jan 22 '25 18:01 wallyqs

Is there a function to force a JetStream filesystem flush?

ckasabula avatar Jan 27 '25 13:01 ckasabula

There is a setting sync: always that could potentially help here. This issue also looks similar #5412

sync: always had no effect. Same corruption.

ckasabula avatar Jan 29 '25 21:01 ckasabula

Looking at the configuration documentation, I think you want sync_interval: always not sync: always.

I would expect sync: always to do nothing/have no effect since it is not a valid configuration key (or at least not documented).

iAnomaly avatar May 09 '25 17:05 iAnomaly

Unfortunately the server has a lot of aliases and multiple names for the same config item not always all shown in docs or examples, which in this case means both sync and sync_interval are valid.

ripienaar avatar May 09 '25 17:05 ripienaar