nats-server
nats-server copied to clipboard
Internal subscription took to long
Observed behavior
We're using nats as event sourcing. One thing we discovered is, that when reading data from subjects high frequently we get a "Internal subscription took to long" message. High frequently means we're reading data very often in milliseconds.
We're also seeing cid:17 - Readloop processing time: 3.5326692s
For reading messages we're reading it in the same way as defined here: https://github.com/bruth/rita/blob/1c52b4b6c9cde205b62349217c70008fd7f5fcaf/eventstore.go#L260-L327
I know this is not the best description, is there anything I can monitor to get a better insight? Is the way with LastMsg + OrderedConsumer a good way to read current messages in a stream? In the past we were using "FetchNoWait" which produced the same issue.
Expected behavior
nats should be able to deliver data very fast for multiple consumers from a given start seq.
Server and client version
Nats Server: 2.10.9
Host environment
For testing we're inside k8s cluster. The memory for nats is limited to 1GB
Steps to reproduce
No response
There are two possible reasons for this.
- Setting up the internal consumer, and it's calculation of NumPending takes a long time.
- You have many sources, or the sources you do have some have messages that are towards the beginning of the stream.
We have improved 1 and that will be part of the next release, 2.10.10, scheduled for next week. We have improved 2 with respect to lots of streams, but have not solved the issue of walking backwards through the majority of the stream when sourced messages exist early in the stream. We are hoping to have a generalized solution for that for 2.11.
For the fixes above you can test against our nightlies or the 2.10.10 branch to confirm if they help your situation.
Thanks for the response. Number one sounds promising. We have right now round about 25 streams. But the streams can get very big. I will checkout the the 2.10.10 version.
For the stream that sources from the other 25, what does nats str info
report?
We don't have a dependency between the streams. Or do you mean something different?
Apologies I thought you were doing stream sourcing.
I think top of main, and what will land in 2.10.10 will help in terms of computing NumPending on consumer creation much more efficiently.
No apologies needed!
Thanks for your feedback. I tried the 2.10.10-RC.1 and it looks very promising. I made a long running test and got no error concerning slow consumer.
Thanks for your work!!!!
Should be released this week.
Hey
Unfortunate we encountered it again. We're on Version 2.10.11.
│ [1] 2024/03/09 02:50:26.280253 [WRN] Internal subscription on "$JS.API.DIRECT.GET.KV_*****" took too long: 5.325651648s │
│ [1] 2024/03/09 02:50:26.280254 [WRN] Internal subscription on "$JS.API.STREAM.INFO.KV_*****" took too long: 2.56795843s │
│ [1] 2024/03/09 02:50:26.281840 [WRN] Internal subscription on "$JS.API.STREAM.INFO.KV_*****" took too long: 2.570922996s │
│ [1] 2024/03/09 02:50:26.281856 [WRN] Internal subscription on "$JS.API.STREAM.INFO.KV_*****" took too long: 2.570995985s │
│ [1] 2024/03/09 02:50:26.282218 [WRN] 10.1.122.96:51016 - cid:202 - Readloop processing time: 2.571013448s │
│ [1] 2024/03/09 02:50:26.282227 [WRN] 10.1.84.65:40888 - cid:160 - Readloop processing time: 5.329409042s │
│ [1] 2024/03/09 11:29:48.700421 [WRN] Internal subscription on "$JS.API.DIRECT.GET.KV_*****" took too long: 2.062718523s │
│ [1] 2024/03/09 11:29:48.705153 [WRN] 10.1.84.65:51362 - cid:296 - Readloop processing time: 2.069251954s │
│ [1] 2024/03/09 22:33:46.621361 [WRN] Internal subscription on "$JS.API.DIRECT.GET.KV_*****" took too long: 2.177865619s │
│ [1] 2024/03/09 22:33:46.622328 [WRN] 10.1.84.65:51362 - cid:296 - Readloop processing time: 2.177977562s │
│ [1] 2024/03/10 00:39:50.878689 [WRN] Internal subscription on "$JS.API.DIRECT.GET.KV_*****" took too long: 7.667904216s │
│ [1] 2024/03/10 00:39:50.879220 [WRN] 10.1.84.65:51362 - cid:296 - Readloop processing time: 7.66800055s │
│ [1] 2024/03/10 00:39:56.133087 [WRN] Internal subscription on "$JS.API.DIRECT.GET.KV_*****" took too long: 5.231629718s │
│ [1] 2024/03/10 00:39:56.133123 [WRN] 10.1.86.28:48270 - cid:167 - Readloop processing time: 5.231695283s │
│ [1] 2024/03/10 00:40:31.406931 [WRN] Internal subscription on "$JS.API.CONSUMER.CREATE.*****" took too long: 10.333078985s │
│ [1] 2024/03/10 00:40:31.406963 [WRN] Internal subscription on "$JS.API.CONSUMER.CREATE.*****" took too long: 10.333631587s │
│ [1] 2024/03/10 00:40:31.406969 [WRN] Internal subscription on "$JS.API.CONSUMER.CREATE.*****" took too long: 10.333713924s │
│ [1] 2024/03/10 00:40:31.406979 [WRN] 10.1.84.65:46036 - cid:328 - Readloop processing time: 10.333740314s
Is there anything I can do, to get a deeper knowledge of why this can happen?
Is the issue easily reproduced with the NATS cli and a direct get? Would you be willing to try top of main or the 2.10.12 release branch and let us know if the issue still presents? If you backup and restore the stream / KV bucket does the issue still present? And if yes to the above, would you be willing to share the backup snapshot with the Synadia team for triage on our side?
I had this same problem when uploading multi GB files to the object store. Upgrading didn't solve it
Do you have instructions to allow us to easily recreate on our side?
I will try to make a reproducible program in the next time. At the moment I have some other important tasks on my table.
I'm trying to put something up at https://github.com/paolobarbolini/nats-server-internal-subscription-repro. For now I just got the CLI to choke-up with nats: error: nats: stalled with too many outstanding async published messages
.
Simple example to show the issue would be great.
If not would want to schedule a video call to inspect the system when it encounters these issues.
We would do cpu, memory and stack dumps/profiles.
I've configured GOMEMLIMIT
(seen from c8651a520f18cc04ba572d0182afe7e29fb45d41) and tried playing around with deleting the stream and recreating it multiple times and forcing new leader elections (via nats stream cluster step-down
) and I may have seen a pattern. I'll try including it in my repro.
we experienced the same issue on 2.10.14 using official helm chart, we've set resource limits as such;
container:
env:
# different from k8s units, suffix must be B, KiB, MiB, GiB, or TiB
# should be ~90% of memory limit
GOMEMLIMIT: 172MiB
merge:
resources:
limits:
cpu: '150m'
ephemeral-storage: 1Gi
memory: 192Mi
requests:
cpu: '100m'
ephemeral-storage: 50Mi
memory: 128Mi
i can confirm we had same logs roughly as per https://github.com/nats-io/nats-server/issues/5000#issuecomment-1987836630
it recovered in 1min, but also triggered some of our alerts
should we increase our resource limits?
edit: doubled the limits nevertheless
we've also identified a stream message whose handler is causing a segault :bow:
We recommend min 8GB for production JetStream enabled servers..
Hello, has anyone been able to find a solution to this? I have been running into this message when testing the nats-server with nats.net v2
I am publishing 10,000 messages from 8 tasks (total 80,000 messages) and attempting to subscribe and read after all messages are published
subscriber buffer size has been limited to 10 messages
using nats v2.10.17, Jetstream max memory is 191.76 GB
Each message is ~0.5MB, so the stream size would be around 40GB. From my understanding however, the stream size is only limited by server resources, so this should not cause issues.
@arvindsouza would need more details such as stream info for the parent stream, and how you are consuming the messages and which subject was reporting slow internal subscribe times.
Have the same issue. Im sourcing a a stream from a handful of leaf nodes. Upon restarting the server, after about 10min, I get the all the Internal subscription... took too long errors. However, when I remove the files from the obs dir in the file store, server starts up without delay. So I think it has something to do with recovering consumers.
Hello, I was testing with the nats.net client & was able to address this by using the GetConsumerAsync rather than CreateOrUpdateConsumerAsync
So retain the consumers once created, even when the server is shut down. When restarting, retrieve the existing consumer(s).