nats-server JetStream: high disk IO, CPU usage, timeouts (Internal subscription took too long)

Observed behavior

We experience strange nats jetstream behavior: it periodically produces huge amount of disk IO (reads and writes) even without any clients connected (0 producers, 0 consumers connected). As a result - high CPU usage during reads and writes and timeouts in log:

Internal subscription on "$JS.API.CONSUMER.CREATE.XXX" took too long: 26.464577809

Stream configuration:

     Acknowledgements: true
            Retention: File - WorkQueue
             Replicas: 1
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false
     Maximum Messages: unlimited
        Maximum Bytes: 30 GiB
          Maximum Age: 30d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

State:

             Messages: 37,886,565
                Bytes: 19 GiB
             FirstSeq: 19,941,809 @ 2024-06-25T07:57:13 UTC
              LastSeq: 65,677,737 @ 2024-06-25T10:28:36 UTC
     Deleted Messages: 7849364
     Active Consumers: 10

One service produced all these messages, another service should consume them in batches (slower than producer).

I'm not sure how to reproduce it, but I have a backup of this "broken" stream locally and did some experiments with previous versions.

Nats 2.10.14 (1), 2.10.12 (2): after restoring this stream from backup (first spike), I see only 1 such IO/CPU spike after 3-5 minutes and later nats works fine.

Nats 2.10.16 (3) : after restoring this stream from backup (first spike), I see repeating peaks every N minutes. N changes if I change jetstream.sync_interval configuration:

I created pprof dump and see these functions in top: syncBlocks, compact, loadMsgsWithLock:

After consuming most messages these spikes becomes less visible, so it depends on stream size (data size or number of messages).

Expected behavior

Now, we use nats 2.8.4 in production and there is no such issue. I know that storage was changed in 2.10, but we did fresh installation of nats, without migrating current production storage.

Server and client version

nats-server: v2.10.16; messages were pushed to stream by Nats.net 1.1.5.0

Host environment

OracleLinux 9.4, Linux 5.15.0-205.149.5.4.el9uek.x86_64 #2 SMP Wed May 8 15:31:38 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux, nats without containers.

Steps to reproduce

No response

Jun 26 '24 11:06 tierpod

Could you share the backup of the stream with us?

[email protected]

Jun 27 '24 02:06 derekcollison

Sorry, but I'm not allowed to share the backup. I'll try to reproduce it using fake data/workload and let you know.

Jun 27 '24 05:06 tierpod

Deleted Messages: 7849364 <-- Maybe the nats-server is busy deleting messages.

Jul 18 '24 06:07 yuzhou-nj

This is our main assumption. It looks like we have quite busy queue - producer pushes a lot of messages into queue, several consumers pull messages in batches and send acks on each message. WorkQueue was chosen because we don't need to keep messages after ack, so we wanted to save filesystem storage space. But after a lot of deletes (that leads to IO operations and latency) nats starts to compact filesystem storage (that leads to IO operation and latency too).

Jul 18 '24 07:07 tierpod

nats-server nats-server copied to clipboard

JetStream: high disk IO, CPU usage, timeouts (Internal subscription took too long)

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

nats-server
nats-server copied to clipboard