nats-streaming-server
nats-streaming-server copied to clipboard
A single client causing a timeout on all publishers for different channels on the cluster.
Hello guys.
I have a scenario where all my publications are timing out from a high cluster consumption load. I would understand that if the cluster had some hardware limitations, however, there is no limitation and there is no lack of resource (Today we have 3 instances in aws with 32GB of 8 cores memory and 1.5TB of SSD disk).
Again, the CLIENT generating time out is normal for the amount of things it wants to do at the same time, but the cluster cannot suffer these consequences affecting all publishers if it still has the resources to use.
My test scenario (which also happens in production) is basically this:
Cluster Configuration:
# Nats Server Local
streaming {
cluster_id: local-nats
store: file
dir: /nats/data
store_limits {
max_channels: 0
max_msgs: 0
max_bytes: 0
max_subs: 0
max_inactivity: "240h"
max_age: "120h"
}
file_options {
file_descriptors_limit: 1000000
}
}
The application runs in Docker, with this compose:
nats-local:
image: nats-streaming:0.16.2
container_name: "nats-local"
hostname: "nats"
command: "-c /nats/nats-local.conf -m 8222"
ports:
- 4222:4222
- 6222:6222
- 8222:8222
volumes:
- /home/renan/projects/local/nats-local:/nats:rw
The application we use to do the test is two, the first one that generates the load with the number of clients/channels:
package main
import (
"fmt"
"log"
"sync"
nats "github.com/nats-io/nats.go"
stan "github.com/nats-io/stan.go"
)
const (
tenant = "goa-timeout-tester-1%d"
serverURL = "nats://ANOTHER_SERVER(NOT-LOCAL-HOST):4222"
clusterID = "local-nats"
channel = "goa.timeout.tester.publisher-1%d.%d"
publisherCount = 35
clientCount = 100
messageCount = 1000
)
func main() {
var wg sync.WaitGroup
wg.Add(clientCount)
for client := 0; client < clientCount; client++ {
go func(client int) {
defer wg.Done()
// Nats connection
natsConnection, err := nats.Connect(serverURL, nats.MaxReconnects(-1))
if err != nil {
panic(err)
}
// Streaming connection
connectionsOptions, err := stan.Connect(clusterID, fmt.Sprintf(tenant, client), stan.NatsConn(natsConnection), stan.Pings(10, 100))
if err != nil {
panic(err)
}
wg.Add(publisherCount)
for p := 0; p < publisherCount; p++ {
go publisher(client, p, connectionsOptions, &wg)
}
}(client)
}
wg.Wait()
}
func publisher(client, publisher int, stanConnection stan.Conn, wg *sync.WaitGroup) {
defer wg.Done()
log.Printf("publication started on channel: %s\n", fmt.Sprintf(channel, client, publisher))
for messages := 0; messages < messageCount; messages++ {
uid, err := stanConnection.PublishAsync(fmt.Sprintf(channel, client, publisher), []byte(`Message of publisher`), func(ackedNuid string, err error) {
if err != nil {
log.Printf("Warning: error publishing msg id %s: %v\n", ackedNuid, err.Error())
}
})
if err != nil {
log.Println(uid, err)
}
}
log.Printf("publication finished on channel: %s\n", fmt.Sprintf(channel, client, publisher))
}
The other application is EQUAL however, the number of messages sent is much larger (just to generate the exception.)
const (
tenant = "goa-timeout-tester-1%d"
serverURL = "nats://ANOTHER_SERVER(NOT-LOCAL-HOST):4222"
clusterID = "local-nats"
channel = "goa.timeout.tester.publisher-1%d.%d"
publisherCount = 35
clientCount = 1
messageCount = 100000
)
The error seems to be generic:
Versions:
nats-streaming-server: 0.16.2
nats.io: master
stan.io: master
I believe it is normal for the first application to generate Timeout due to the amount of submissions it is doing, however, the second application returns a time out even though the cluster is healthy, does not seem normal to me.
I am available to clarify anything! Thanks in advance.