nats-streaming-server icon indicating copy to clipboard operation
nats-streaming-server copied to clipboard

NATS Streaming startup takes time

Open Kazylla opened this issue 5 years ago • 9 comments

Hi

I am trying to configure NATS Streaming cluster. After sending 400k msg/sec for 30 minutes to 3 nodes and restarting NATS Streaming, it took about 30 minutes to start up. During startup, read access to data volume /data was performed at approximately 2.5k op/sec. I trace the process and found that the following calls in server/raft_log.go: decodeRaftLog is repeating.

err := dec.Decode(log)
  • Is there a way to reduce the time taken for this process?
  • Is there an alternative?

my conf for one of the nodes is:

listen: 0.0.0.0:4222
http: 8222

# Authorization for client connections
authorization {
  user: "xxx"
  password: "xxx"
  timeout: 0
}

# Cluster definition
cluster {
  listen: 0.0.0.0:5222

  authorization {
    user: "xxx"
    password: "xxx"
    timeout: 0
  }

  routes = [
    "nats-route://xxx:xxx@queue01:5222"
    "nats-route://xxx:xxx@queue02:5222"
    "nats-route://xxx:xxx@queue03:5222"
  ]
}

# logging options
debug: false
trace: false
logtime: true
log_file: ""

# Some system overides
max_connections: 0
max_control_line: 0
max_payload: 0

# Duration the server can block on a socket write to a client.  Exceeding the
# deadline will designate a client as a slow consumer.
write_deadline: "2s"

# NATS Streaming specific configuration
streaming {
  id: "xxx"
  store: "file"
  dir: "/data/stan/store"
  sd: false
  sv: false

  cluster {
    node_id: "queue01"

    peers: [
      "queue01"
      "queue02"
      "queue03"
    ]
  }

  file_options: {
    compact: false
  }
}

Sorry, I'm not good at English

Kazylla avatar Apr 11 '19 02:04 Kazylla

@Kazylla Unfortunately, RAFT replays the whole log and so that can't be avoided. We can look at see if it can be optimized, but not sure. Also note that if there were snapshots, that should have reduced the size of the RAFT logs, and therefore the number of calls to decode. You could try experimenting with trailing_logs and log_snapshots. The former is how many entries in the RAFT log to keep after a snapshot, the second is the number of snapshots to possibly recover from on restart. You could try setting trailing_logs: 1000 and log_snapshots: 1 and see if it helps. Note that you should see tracing about Deleting raft logs. If you don't, it means that server did not have a chance to do a snapshot (RAFT is not snapshotting too often, configured to do so no more than once every 2 minutes).

PS: Your english is excellent!

kozlovic avatar Apr 12 '19 01:04 kozlovic

I'm sorry for replying late.

I tried it, but it was not effective in my use case. It takes a few minutes to 10 minutes or more to start up.

In addition, we decided that we could not use it due to the following problems.

  • Continued execution will increase memory usage and eventually run out
  • When the following log is output, it becomes response waiting, and processing on the client side is delayed
[7346] 2019/04/25 08:32:18.887420 [INF] STREAM: Deleting raft logs from 1 to 10540
[7346] 2019/04/25 08:32:28.399631 [ERR] STREAM: Deletion took 9.512200902s

Please let me know if there is a solution to these problems.

Message processing capacity on EC2 instance that I measured is as follows (detailed conditions are omitted).

standalone (c5.xlarge,EBS 1080iops): 700client4000msg/sec FT (c5.xlarge,EFS generalPurpose/Provisioned 1024 MiB/sec): 255client4000msg/sec clustering (3node,c5.xlarge,EBS 1080iops): 150client*4000msg/sec

Kazylla avatar Apr 26 '19 03:04 Kazylla

Unfortunately we use hashicorp RAFT and their go-msgpack codec. I need to issue a release today so I will try to have a look again at this after that, but I don't think we can improve much.

We are working on another project to "replace" NATS Streaming, called JetStream that hopefully will not have these dependencies. However, there is no timeline on when this will be available.

kozlovic avatar Apr 26 '19 20:04 kozlovic

@Kazylla Actually, I wonder if I could not replace the use of the go-msgpack with protobuf. Seem that it would be much faster. That would not work with existing raft store, but I could see that being an config to select which encoding/decoding to use. I will need to experiment a bit more and see if this is a viable approach. Stay tuned.

kozlovic avatar Apr 27 '19 02:04 kozlovic

After further investigation, I am not sure this can be improved. What happens is that when the RAFT node is created, the RAFT library goes through ALL logs to look for LogConfiguration entry type. What makes it slow is that a raft.Log instance is created and decoded, even when it is not the type the library is looking for.

So I hacked that to use my own simple binary encoding and modified RAFT to let the store implementation know that it is looking for those types, so I end-up decoding in the NewRaft() phase only configuration logs. This obviously speeds up the startup of a node, but then, when a leader is elected it needs to go through processing of ALL logs again to replay them to the Finite State Machine (which in this case is the Streaming server code). This means that when a server being elected leader and performs leader actions, this is when you would now see the long delay. And to conclude was is slow is copying bytes for the Data field of the raft.Log object (a raft.Log object has Index uint64, Term uint64, LogType a byte, and Data, a byte array).

The only hope would be to have a way to aggressively snapshot the RAFT log so that we remove Streaming messages that are stored there during normal replication. The messages are also stored in Streaming actual stores and during a snapshot, we simply store in RAFT snapshots the first/last index of a given channel. This then allows RAFT to delete some log entries in its log.

I will continue poking around, but at this time, not sure there is an easy answer.

kozlovic avatar Apr 27 '19 18:04 kozlovic

I did again a quick dirty hack where I snapshot every second, and that helps a lot... again, need to think more about this and try few things.

kozlovic avatar Apr 27 '19 19:04 kozlovic

Hello, any news on this? We are discovering the same issue. Our NATS-Streaming nodes are persisting millions of messages in Postgres databases, resulting in a 32GB size for each of the databases. Today we had to restart some nodes and the startup took more than 1 hour. The log only showed the entry "Recovering the state..." Is there a possibility to speed up the startup phase? Which store limits do you recommend?

wutkemtt avatar Nov 28 '19 06:11 wutkemtt

@wutkemtt Do you know if the time taken was recovering the RAFT log or did it have to do with DB access? For the RAFT log, unfortunately, I am not sure there is much I can do here. We use Hashicorp RAFT and etcd/bbolt db implementation for the RAFT logs.

kozlovic avatar Dec 02 '19 16:12 kozlovic

The log entry is just showing "Recovering state..." Database activity is normal within that timespan, so possibly the service is recovering the raft log. Last time this timespan was different for all three nodes and last from 30 minutes up to 1,5 hours.

wutkemtt avatar Dec 02 '19 17:12 wutkemtt