nats-server
nats-server copied to clipboard
Abnormal NATS write load associated with a specific JetStream [v2.9.16, v2.10.16]
Observed behavior
Hello.
We have detected high disk write load on some servers of the NATS cluster. (up to 100 MB/s)
Restarting the NATS server helps reduce the disk load. We had no ideas what could have caused it, so we conducted an analysis using Strace:
407112 08:02:01 write(90</data/jetstream/MYSTREAM/streams/my-stream_prod/obs/my-stream_user_rating_fetch_did_request_v1_prod/o.dat>, "\26\2\345\255\241 \244\260\206\234\1\360\213\363\177\344\323\253\226\4\242\211\303\1\364\351\230\344\f\335\300\326"..., 58764653 <unfinished ...>
407112 08:02:01 <... write resumed> ) = 58764653
407112 08:02:02 write(43</data/jetstream/MYSTREAM/streams/my-stream_prod/obs/my-stream_user_rating_fetch_did_request_v1_prod/o.dat>, "\26\2\345\255\241 \244\260\206\234\1\363\213\363\177\361\323\253\226\4\243\211\303\1\364\351\230\344\f\253\264\275"..., 58764663 <unfinished ...>
407112 08:02:02 <... write resumed> ) = 58764663
407112 08:02:03 write(43</data/jetstream/MYSTREAM/streams/my-stream_prod/obs/my-stream_user_rating_fetch_did_request_v1_prod/o.dat>, "\26\2\345\255\241 \244\260\206\234\1\364\213\363\177\363\323\253\226\4\241\211\303\1\366\351\230\344\f\316\307\242"..., 58764643 <unfinished ...>
407112 08:02:03 <... write resumed> ) = 58764643
nats stream info
? Select a Stream my-stream_prod
Information for Stream my-stream_prod created 2023-04-27 14:25:50
Subjects: my-stream.*.*.prod
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgements: true
Discard Policy: Old
Duplicate Window: 2m0s
Direct Get: true
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: unlimited
Maximum Bytes: unlimited
Maximum Age: 15m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-cluster-4
Replica: nats-cluster-1, current, seen 0.00s ago
Replica: nats-cluster-2, outdated, seen 0.00s ago, 1 operation behind
State:
Messages: 111,685
Bytes: 21 MiB
FirstSeq: 4,136,545,200 @ 2024-05-15T10:51:20 UTC
LastSeq: 4,136,656,884 @ 2024-05-15T11:06:20 UTC
Active Consumers: 10
Number of Subjects: 9
nats consumer info
? Select a Stream my-stream_prod
? Select a Consumer [Use arrows to move, type to filter]
>
...
my-stream_user_rating_fetch_did_request_v1_prod
...
Expected behavior
Disk write throughput is ~1 MB/s like after restart
Server and client version
nats-server:2.9.16
github.com/nats-io/nats.go v1.31.0
Host environment
Kubernetes v1.26.12-eks
Installed from helm chart:
apiVersion: v2
appVersion: 2.8.4
description: A Helm chart for the NATS.io High Speed Cloud Native Distributed Communications
Technology.
home: http://github.com/nats-io/k8s
icon: https://nats.io/img/nats-icon-color.png
keywords:
- nats
- messaging
- cncf
maintainers:
- email: [email protected]
name: Waldemar Quevedo
url: https://github.com/wallyqs
- email: [email protected]
name: Colin Sullivan
url: https://github.com/ColinSullivan1
- email: [email protected]
name: Jaime Piña
url: https://github.com/nsurfer
- email: [email protected]
name: Caleb Lloyd
url: https://github.com/caleblloyd
name: nats
version: 0.17.5
Steps to reproduce
No response
This is a pretty old NATS server version. I would start with upgrading to latest 2.10.14 to see it that solves the problem. We did put a lot of effort into JetStream between those two versions.
Thank you, I'll try.
When we updated the version from 2.8.4 to 2.9.16, the performance became much better.
Although the behavior seems strange, we have many other streams and there are no problems with them.
Hi! I'm a part of the team. We upgraded the cluster to version 2.10.16 but are still having problems.
Please can you provide some context? At a minimum updated stream info and consumer info, updated statistics/graphs, any other relevant configuration changes, etc.
We have detected high disk write load on some servers' NATS clusters. Restarting helps resolve the issue
The cluster with 3 nodes v2.10.16. It has 8 jet streams, and all streams have the same options
stream info
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Direct Get: true
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: unlimited
Maximum Bytes: unlimited
Maximum Age: 15m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats-v2-10
Leader: nats-v2-10-1
Replica: nats-v2-10-0, current, seen 57ms ago
Replica: nats-v2-10-2, current, seen 56ms ago
State:
Messages: 23,913
Bytes: 11 MiB
First Sequence: 239,457,256 @ 2024-10-18 09:37:24 UTC
Last Sequence: 239,481,168 @ 2024-10-18 09:52:24 UTC
Active Consumers: 11
Number of Subjects: 10
consumer info
Configuration:
Name: bid_did_active_v1_ru_prod
Pull Mode: true
Filter Subject: bids.bid_did_active.v1.ru.prod
Deliver Policy: All
Ack Policy: Explicit
Ack Wait: 6.00s
Replay Policy: Instant
Maximum Deliveries: 5
Max Waiting Pulls: 512
Cluster Information:
Name: nats-v2-10
Leader: nats-v2-10-1
Replica: nats-v2-10-0, current, seen 23ms ago
Replica: nats-v2-10-2, current, seen 23ms ago
State:
Last Delivered Message: Consumer sequence: 44,196,707 Stream sequence: 239,491,181 Last delivery: 38ms ago
Acknowledgment Floor: Consumer sequence: 44,196,707 Stream sequence: 239,491,181 Last Ack: 24ms ago
Outstanding Acks: 0
Redelivered Messages: 0
Unprocessed Messages: 0
Waiting Pulls: 2 of maximum 512
metrics
Before restart > 100Mb/s and after < 2Mb/s
Memory
Load profile doesn't change
After 12 days. We see a slow increase in disk writes per day
Load profile
@neilalexander do you need more information?