nats-streaming-server icon indicating copy to clipboard operation
nats-streaming-server copied to clipboard

Cluster performance degradation over time

Open qtheya opened this issue 4 years ago • 12 comments

Hello. We have: Setup 1: Single instance of nats-streaming in k8s on c5.2xlarge In-Memory store

writeDeadline: "2s"
maxPayload: 268435456
maxConnections: 500
maxSubscriptions: 100000
maxPending: 268435456 
maxControlLine: 4096
+
    store_limits: {
      max_channels: 25000
      max_msgs: 0
      max_bytes: 0
      max_subs: 0
      max_age: "2m"
      max_inactivity: "1m"
      fds_limit: "1000000"
    }

image image Setup 2: nats-operator 3x nodes nats-streaming-operator 3x nodes in k8s on c5.2xlarge Filestore mounted to emptyDir.medium: Memory (tmpfs)

writeDeadline: "2s"
maxPayload: 268435456
maxConnections: 500
maxSubscriptions: 100000
maxPending: 268435456 
maxControlLine: 4096
+
streaming: {
    store_limits: {
        max_channels: 25000
        max_msgs: 0
        max_bytes: 0
        max_subs: 0
        max_age: "2m"
        max_inactivity: "1m"
        fds_limit: "1000000"
    }

    cluster: {
        sync: false
    }

    file_options: {
        sync_on_flush: false
        auto_sync: "0"
    }
}

image image in top it was 400-500% cpu usage

Our usual load looks like this: image

k8s nodes have antiAffinity and podAntiAffinity too

Questions: 0) As you can see, performance in cluster mode is ok for some period after start of cluster. But over time it starts "eating" cpu and as result ack time increasing. Do you have any ideas why it happens ? Our setup with single node in that environment works like a charm for 3 months.

  1. Can you please advise some configuration options to prevent cluster performance degradation?
  2. Why difference between single node and 3x cluster so big ? It is because of raft ?

Thnx

qtheya avatar Nov 21 '19 18:11 qtheya

Do me a favor and look for log notices in the log similar to:

Deleting raft logs from

and

Deletion took 

If more than 2secs, the above would be reported as an error, below that, as a notice.

I am wondering if this is not the RAFT log compaction that is starting to degrade the performance. Based on that, we will see what's the next step.

kozlovic avatar Nov 21 '19 18:11 kozlovic

Hello. Thank you for quick reply.

2019-11-21 17:09:48	
[1] 2019/11/21 15:09:48.621200 [INF] STREAM: Deleting raft logs from 7293537 to 7458159 
2019-11-21 17:09:38	
[1] 2019/11/21 15:09:38.223431 [INF] STREAM: Deletion took 363.947021ms 
2019-11-21 17:09:37	
[1] 2019/11/21 15:09:37.859464 [INF] STREAM: Deleting raft logs from 7186303 to 7444561 
2019-11-21 17:07:33	
[1] 2019/11/21 15:07:33.936883 [INF] STREAM: Deletion took 169.862979ms 
2019-11-21 17:07:33	
[1] 2019/11/21 15:07:33.767010 [INF] STREAM: Deleting raft logs from 7146728 to 7297456 
2019-11-21 17:07:29	
[1] 2019/11/21 15:07:29.776540 [INF] STREAM: Deletion took 409.790597ms 
2019-11-21 17:07:29	
[1] 2019/11/21 15:07:29.366740 [INF] STREAM: Deleting raft logs from 7032943 to 7293536 Show context
2019-11-21 17:06:03	
[1] 2019/11/21 15:06:03.671335 [INF] STREAM: Deletion took 270.357439ms 
2019-11-21 17:06:03	
[1] 2019/11/21 15:06:03.400967 [INF] STREAM: Deleting raft logs from 6977609 to 7186302 
2019-11-21 17:05:32	
[1] 2019/11/21 15:05:32.864749 [INF] STREAM: Deletion took 309.693593ms 
2019-11-21 17:05:32	
[1] 2019/11/21 15:05:32.555042 [INF] STREAM: Deleting raft logs from 6888863 to 7146727 
2019-11-21 17:03:48	
[1] 2019/11/21 15:03:48.992104 [INF] STREAM: Deletion took 286.972611ms 
2019-11-21 17:03:48	
[1] 2019/11/21 15:03:48.705118 [INF] STREAM: Deleting raft logs from 6817547 to 7032942 
2019-11-21 17:03:04	
[1] 2019/11/21 15:03:04.681439 [INF] STREAM: Deletion took 312.350781ms 
2019-11-21 17:03:04	
[1] 2019/11/21 15:03:04.369079 [INF] STREAM: Deleting raft logs from 6760781 to 6977608 
2019-11-21 17:02:01	
[1] 2019/11/21 15:02:01.175832 [INF] STREAM: Deletion took 181.40984ms 
2019-11-21 17:02:00	
[1] 2019/11/21 15:02:00.994410 [INF] STREAM: Deleting raft logs from 6731069 to 6888862 
2019-11-21 17:00:54	
[1] 2019/11/21 15:00:54.802829 [INF] STREAM: Deletion took 263.82555ms 
2019-11-21 17:00:54	
[1] 2019/11/21 15:00:54.538995 [INF] STREAM: Deleting raft logs from 6626658 to 6817546 
2019-11-21 17:00:10	
[1] 2019/11/21 15:00:10.653550 [INF] STREAM: Deletion took 156.644787ms 

qtheya avatar Nov 21 '19 18:11 qtheya

So that's not it..

  • I see that you set your max payload to 256MB... are you really sending big messages like this?
  • Have you just started to use in cluster mode? (you say that standalone has been working for 3 months without issues)
  • What is your usage pattern? I see that you have a high limit of channels with very low max_age and max_inactivity

kozlovic avatar Nov 21 '19 19:11 kozlovic

Let me answer that (I'm working with @qtheya):

  • No, we do not need that big messages, our messages are mostly under 1kB
  • We've had serious issues with cluster mode initially and after multiple days of suffering decided to try single node mode and it worked wonderfully for over 3 months; however, eventually we want to have HA and started experimenting again with newer versions of NATS and NATS Streaming and still have the same issue, cluster becomes very slow under load, to the point of being unusable
  • Here is our use case (for this specific instance):
    • we are streaming updates from multiple sources for users
    • updates are sent periodically and each subsequent update of the same type overrides previous one
    • those updates are only meaningful for short period of time and only if user is actually checking them
    • by subscribing on user request with negative time offset that is bigger than update interval we can fetch enough updates to recreate effective current state without having to store it anywhere explicitly as well as get new updates as they come in

nazar-pc avatar Nov 21 '19 19:11 nazar-pc

For your use case, I wonder if a key-value store would not be a better fit.

Regardless, since you use some memory store (we don't support this at the configuration level, but you bypass by using tmpfs as filestore), I see that you are not defining the node ids and list of peers, how do you initialize the cluster? Without naming the node and its peers, the only way it would work is to "bootstrap" it, but this is very dangerous if you have all the nodes passing the bootstrap flag and the nodes possibly restart without their state. Furthermore, there is no ability in NATS Streaming to remove a node from the cluster, so the fact that you don't name them (and list the number of peers) and that they have ephemeral storage is a very bad combo. If a leader is running long enough and followers have come and go, the leader will still try to replicate data to all nodes that it knew about...

kozlovic avatar Nov 21 '19 21:11 kozlovic

Key-value store doesn't work in this use case, while NATS Streaming fits really nicely with minimal effort.

I believe we are using https://github.com/nats-io/nats-operator and https://github.com/nats-io/nats-streaming-operator with number of nodes in cluster set to 3 for both.

But our primary issue is not with cluster survival (though we did have many spot instances leaving during a day and cluster seem to survive them), but with performance we get out of it.

We've resorted to tmpfs primarily because I/O load on disk subsystem was very high and caused all kinds of issues for nodes in our cluster when we've tried it. tmpfs was next logical option considering our use case. We though that may have been the bottleneck that caused performance degradation, but it didn't help (much?).

nazar-pc avatar Nov 21 '19 22:11 nazar-pc

But our primary issue is not with cluster survival (though we did have many spot instances leaving during a day and cluster seem to survive them), but with performance we get out of it.

But my point is that it may be that if you have several nodes being brought down and up again but every time with different names, and the leader is still alive, it will try to replicate to the past known nodes. I need to check with Wally how the operator works in term of node id/peers specification (from your config, it seems that you don't specify any).

kozlovic avatar Nov 21 '19 22:11 kozlovic

But our primary issue is not with cluster survival (though we did have many spot instances leaving during a day and cluster seem to survive them), but with performance we get out of it.

But my point is that it may be that if you have several nodes being brought down and up again but every time with different names, and the leader is still alive, it will try to replicate to the past known nodes. I need to check with Wally how the operator works in term of node id/peers specification (from your config, it seems that you don't specify any).

nats-operator and nats-streaming-operator bring nodes as statefulset. So, names are always the same, i.e:

  • nats-1
  • nats-2
  • nats-3 and
  • nats-streaming-1
  • nats-streaming-2
  • nats-streaming-3 Both operators were designed before nats was embedded into nats-streaming, therefore k8s service with port 4222 to which clients are connecting is pointed to nats not to nats-streaming. The problem is not in naming because problems with performance appear before any restarts, rescheduling, etc. The problem looks like this: cluster works fine for ~10m-1h, then ack-time is increasing, after some additional period it increases again, then again, etc without changes on cluster level

qtheya avatar Nov 22 '19 07:11 qtheya

Sorry for the delay. So if I wanted to reproduce:

  • how many channels are being created?
  • what is the rate of messages (how many messages/sec per channel)?
  • are channels being removed often due to inactivity?
  • .. anything you think may help reproduce.

Again, I want to insist of the cluster deployment. Even if this is not the problem at hand, yes the operator is deployed as a statefulset, but those names are not the one of the streaming node ids, which means that some random name is generated and cluster is initially formed, but if a node goes away (and since you don't use "persistent" storage), then even if the pod restarts as say "nats-streaming-2", its configuration does not have a cluster node id specified, and since there is no store, it will start as a fresh new node with a unique node ID. Again, the leader will have in its list of peers the old cluster node ID and will continuously try to replicate to it. Something to keep in mind...

kozlovic avatar Nov 25 '19 17:11 kozlovic

Hello. Here is info about out load and channels for last 2 days on single node nats-streaming. If you need some more information just let me know. Channels creation: image Removes: image Msgs IO: image Active subs: image The biggest channel takes around 40% of all traffic, i think. There are ~75 connections to nats-streaming. We tried (because we want it) to use "persistent" storage, but for our load each node has to have 1.5k-2k IOPS that is a very expensive pleasure on AWS :(

Pubs connection options. stan.MaxPubAcksInflight(2048), stan.PubAckWait(60*time.Second), stan.Pings(30, 5)

qtheya avatar Nov 27 '19 13:11 qtheya

But that's single node, and I assume this is low level NATS messages in/out (since we don't have this metric in Streaming monitoring). In the cluster, in/out will be way higher since messages (and meta data: connection/subscription create/close, msg x sent to subscription y, subscription y ack'ing msg x, etc..), are replicated to all nodes and there is on top of that the RAFT protocol messages (quite a lot).

Should I assume that load/etc would be same in cluster mode?

So if I wanted to simulate the load, what would I have to do? You gave me information about the number of connections, and the options to use for the pubs, but what should I do?

  • Should I create 75 connections and they are long lived or they come and go, if so at what frequency?
  • I see the number of subscriptions go from low 5000 to high 8500.. are they part of those 75 connections? Do they start and then go away after few secs, min? Are they durables or not, queue or not?
  • Same for publishing: are messages published to new channels or existing ones? What is the rate, etc..

There is a lot I am missing to even try to reproduce the behavior you are experiencing. If you were running NATS Streaming with embedded NATS Server, there would be the option to enable the profiling (-profile ) which allows to use go tooling to capture some profile (cpu or mem) when a server would be experiencing those high CPU usage. That may have given a clue on what it is doing at this time. What you could do, however, is to define a monitoring port in the streaming config file (but I think the operator is doing this by default from command line params and port is 8222). This will allow you to capture the stack trace of the process when you see a node experiencing the high cpu usage/delayed ack time. The endpoint will be http://<streaming server host>:<streaming server monitoring port>/stacksz. Again, this has to be in the streaming server's monitoring port (since you use dedicated NATS Servers cluster).

kozlovic avatar Nov 27 '19 18:11 kozlovic

Hi. Thank you for explanation. Here is metrics from 3x cluster at the same point in time as screenshots in first messages in this issue: Msgs IO: image 75 long live connections. Subscriptions are not durable. We have around 25 qgroups. Subscriptions are divided between connections. Number of channels created + deleted per minute ~250. Publishers send ~3500 msgs/sec. Most channels stay alive for hours, channels that close in seconds/minutes are exceptions.

qtheya avatar Nov 28 '19 10:11 qtheya