nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

Internal subscription took to long

Open simonhoss opened this issue 1 year ago • 26 comments

Observed behavior

We're using nats as event sourcing. One thing we discovered is, that when reading data from subjects high frequently we get a "Internal subscription took to long" message. High frequently means we're reading data very often in milliseconds.

We're also seeing cid:17 - Readloop processing time: 3.5326692s

For reading messages we're reading it in the same way as defined here: https://github.com/bruth/rita/blob/1c52b4b6c9cde205b62349217c70008fd7f5fcaf/eventstore.go#L260-L327

I know this is not the best description, is there anything I can monitor to get a better insight? Is the way with LastMsg + OrderedConsumer a good way to read current messages in a stream? In the past we were using "FetchNoWait" which produced the same issue.

Expected behavior

nats should be able to deliver data very fast for multiple consumers from a given start seq.

Server and client version

Nats Server: 2.10.9

Host environment

For testing we're inside k8s cluster. The memory for nats is limited to 1GB

Steps to reproduce

No response

simonhoss avatar Jan 25 '24 09:01 simonhoss

There are two possible reasons for this.

  1. Setting up the internal consumer, and it's calculation of NumPending takes a long time.
  2. You have many sources, or the sources you do have some have messages that are towards the beginning of the stream.

We have improved 1 and that will be part of the next release, 2.10.10, scheduled for next week. We have improved 2 with respect to lots of streams, but have not solved the issue of walking backwards through the majority of the stream when sourced messages exist early in the stream. We are hoping to have a generalized solution for that for 2.11.

For the fixes above you can test against our nightlies or the 2.10.10 branch to confirm if they help your situation.

derekcollison avatar Jan 25 '24 11:01 derekcollison

Thanks for the response. Number one sounds promising. We have right now round about 25 streams. But the streams can get very big. I will checkout the the 2.10.10 version.

simonhoss avatar Jan 25 '24 13:01 simonhoss

For the stream that sources from the other 25, what does nats str info report?

derekcollison avatar Jan 25 '24 14:01 derekcollison

We don't have a dependency between the streams. Or do you mean something different?

simonhoss avatar Jan 25 '24 17:01 simonhoss

Apologies I thought you were doing stream sourcing.

I think top of main, and what will land in 2.10.10 will help in terms of computing NumPending on consumer creation much more efficiently.

derekcollison avatar Jan 25 '24 17:01 derekcollison

No apologies needed!

Thanks for your feedback. I tried the 2.10.10-RC.1 and it looks very promising. I made a long running test and got no error concerning slow consumer.

Thanks for your work!!!!

simonhoss avatar Jan 28 '24 19:01 simonhoss

Should be released this week.

derekcollison avatar Jan 28 '24 19:01 derekcollison

Hey

Unfortunate we encountered it again. We're on Version 2.10.11.

│ [1] 2024/03/09 02:50:26.280253 [WRN] Internal subscription on "$JS.API.DIRECT.GET.KV_*****" took too long: 5.325651648s                                                                                                              │
│ [1] 2024/03/09 02:50:26.280254 [WRN] Internal subscription on "$JS.API.STREAM.INFO.KV_*****" took too long: 2.56795843s                                                                                                │
│ [1] 2024/03/09 02:50:26.281840 [WRN] Internal subscription on "$JS.API.STREAM.INFO.KV_*****" took too long: 2.570922996s                                                                                               │
│ [1] 2024/03/09 02:50:26.281856 [WRN] Internal subscription on "$JS.API.STREAM.INFO.KV_*****" took too long: 2.570995985s                                                                                               │
│ [1] 2024/03/09 02:50:26.282218 [WRN] 10.1.122.96:51016 - cid:202 - Readloop processing time: 2.571013448s                                                                                                                                │
│ [1] 2024/03/09 02:50:26.282227 [WRN] 10.1.84.65:40888 - cid:160 - Readloop processing time: 5.329409042s                                                                                                                                 │
│ [1] 2024/03/09 11:29:48.700421 [WRN] Internal subscription on "$JS.API.DIRECT.GET.KV_*****" took too long: 2.062718523s                                                                                                              │
│ [1] 2024/03/09 11:29:48.705153 [WRN] 10.1.84.65:51362 - cid:296 - Readloop processing time: 2.069251954s                                                                                                                                 │
│ [1] 2024/03/09 22:33:46.621361 [WRN] Internal subscription on "$JS.API.DIRECT.GET.KV_*****" took too long: 2.177865619s                                                                                                              │
│ [1] 2024/03/09 22:33:46.622328 [WRN] 10.1.84.65:51362 - cid:296 - Readloop processing time: 2.177977562s                                                                                                                                 │
│ [1] 2024/03/10 00:39:50.878689 [WRN] Internal subscription on "$JS.API.DIRECT.GET.KV_*****" took too long: 7.667904216s                                                                                                              │
│ [1] 2024/03/10 00:39:50.879220 [WRN] 10.1.84.65:51362 - cid:296 - Readloop processing time: 7.66800055s                                                                                                                                  │
│ [1] 2024/03/10 00:39:56.133087 [WRN] Internal subscription on "$JS.API.DIRECT.GET.KV_*****" took too long: 5.231629718s                                                                                                                    │
│ [1] 2024/03/10 00:39:56.133123 [WRN] 10.1.86.28:48270 - cid:167 - Readloop processing time: 5.231695283s                                                                                                                                 │
│ [1] 2024/03/10 00:40:31.406931 [WRN] Internal subscription on "$JS.API.CONSUMER.CREATE.*****" took too long: 10.333078985s                                                                                                            │
│ [1] 2024/03/10 00:40:31.406963 [WRN] Internal subscription on "$JS.API.CONSUMER.CREATE.*****" took too long: 10.333631587s                                                                                                            │
│ [1] 2024/03/10 00:40:31.406969 [WRN] Internal subscription on "$JS.API.CONSUMER.CREATE.*****" took too long: 10.333713924s                                                                                                            │
│ [1] 2024/03/10 00:40:31.406979 [WRN] 10.1.84.65:46036 - cid:328 - Readloop processing time: 10.333740314s

Is there anything I can do, to get a deeper knowledge of why this can happen?

simonhoss avatar Mar 11 '24 08:03 simonhoss

Is the issue easily reproduced with the NATS cli and a direct get? Would you be willing to try top of main or the 2.10.12 release branch and let us know if the issue still presents? If you backup and restore the stream / KV bucket does the issue still present? And if yes to the above, would you be willing to share the backup snapshot with the Synadia team for triage on our side?

derekcollison avatar Mar 11 '24 10:03 derekcollison

I had this same problem when uploading multi GB files to the object store. Upgrading didn't solve it

paolobarbolini avatar Mar 19 '24 15:03 paolobarbolini

Do you have instructions to allow us to easily recreate on our side?

derekcollison avatar Mar 19 '24 16:03 derekcollison

I will try to make a reproducible program in the next time. At the moment I have some other important tasks on my table.

simonhoss avatar Mar 19 '24 17:03 simonhoss

I'm trying to put something up at https://github.com/paolobarbolini/nats-server-internal-subscription-repro. For now I just got the CLI to choke-up with nats: error: nats: stalled with too many outstanding async published messages.

paolobarbolini avatar Mar 28 '24 09:03 paolobarbolini

Simple example to show the issue would be great.

If not would want to schedule a video call to inspect the system when it encounters these issues.

We would do cpu, memory and stack dumps/profiles.

derekcollison avatar Mar 28 '24 16:03 derekcollison

I've configured GOMEMLIMIT (seen from c8651a520f18cc04ba572d0182afe7e29fb45d41) and tried playing around with deleting the stream and recreating it multiple times and forcing new leader elections (via nats stream cluster step-down) and I may have seen a pattern. I'll try including it in my repro.

paolobarbolini avatar Apr 04 '24 13:04 paolobarbolini

we experienced the same issue on 2.10.14 using official helm chart, we've set resource limits as such;

                container:
                    env:
                        # different from k8s units, suffix must be B, KiB, MiB, GiB, or TiB
                        # should be ~90% of memory limit
                        GOMEMLIMIT: 172MiB
                    merge:
                        resources:
                            limits:
                                cpu: '150m'
                                ephemeral-storage: 1Gi
                                memory: 192Mi
                            requests:
                                cpu: '100m'
                                ephemeral-storage: 50Mi
                                memory: 128Mi

i can confirm we had same logs roughly as per https://github.com/nats-io/nats-server/issues/5000#issuecomment-1987836630

it recovered in 1min, but also triggered some of our alerts

should we increase our resource limits?

edit: doubled the limits nevertheless

we've also identified a stream message whose handler is causing a segault :bow:

ro0NL avatar May 01 '24 08:05 ro0NL

We recommend min 8GB for production JetStream enabled servers..

derekcollison avatar May 02 '24 02:05 derekcollison

Hello, has anyone been able to find a solution to this? I have been running into this message when testing the nats-server with nats.net v2

I am publishing 10,000 messages from 8 tasks (total 80,000 messages) and attempting to subscribe and read after all messages are published

subscriber buffer size has been limited to 10 messages

using nats v2.10.17, Jetstream max memory is 191.76 GB

Each message is ~0.5MB, so the stream size would be around 40GB. From my understanding however, the stream size is only limited by server resources, so this should not cause issues.

arvindsouza avatar Jul 01 '24 06:07 arvindsouza

@arvindsouza would need more details such as stream info for the parent stream, and how you are consuming the messages and which subject was reporting slow internal subscribe times.

derekcollison avatar Jul 01 '24 16:07 derekcollison

Have the same issue. Im sourcing a a stream from a handful of leaf nodes. Upon restarting the server, after about 10min, I get the all the Internal subscription... took too long errors. However, when I remove the files from the obs dir in the file store, server starts up without delay. So I think it has something to do with recovering consumers.

typecampo avatar Jul 15 '24 23:07 typecampo

Hello, I was testing with the nats.net client & was able to address this by using the GetConsumerAsync rather than CreateOrUpdateConsumerAsync

So retain the consumers once created, even when the server is shut down. When restarting, retrieve the existing consumer(s).

arvindsouza avatar Jul 25 '24 17:07 arvindsouza