nats-streaming-server icon indicating copy to clipboard operation
nats-streaming-server copied to clipboard

A single client causing a timeout on all publishers for different channels on the cluster.

Open renanberto opened this issue 4 years ago • 7 comments

Hello guys.

I have a scenario where all my publications are timing out from a high cluster consumption load. I would understand that if the cluster had some hardware limitations, however, there is no limitation and there is no lack of resource (Today we have 3 instances in aws with 32GB of 8 cores memory and 1.5TB of SSD disk).

Again, the CLIENT generating time out is normal for the amount of things it wants to do at the same time, but the cluster cannot suffer these consequences affecting all publishers if it still has the resources to use.

My test scenario (which also happens in production) is basically this:

Cluster Configuration:

# Nats Server Local

streaming {
  cluster_id: local-nats
  store: file
  dir: /nats/data
  
  store_limits {
      max_channels: 0
      max_msgs: 0
      max_bytes: 0
      max_subs: 0
      max_inactivity: "240h"
      max_age: "120h"
  }

  file_options {
    file_descriptors_limit: 1000000
  }
}

The application runs in Docker, with this compose:

nats-local:
  image: nats-streaming:0.16.2  
  container_name: "nats-local"
  hostname: "nats"
  command: "-c /nats/nats-local.conf -m 8222"
  ports:
    - 4222:4222
    - 6222:6222
    - 8222:8222
  volumes:
    - /home/renan/projects/local/nats-local:/nats:rw

The application we use to do the test is two, the first one that generates the load with the number of clients/channels:

package main

import (
	"fmt"
	"log"

	"sync"

	nats "github.com/nats-io/nats.go"
	stan "github.com/nats-io/stan.go"
)

const (
	tenant         = "goa-timeout-tester-1%d"
	serverURL      = "nats://ANOTHER_SERVER(NOT-LOCAL-HOST):4222"
	clusterID      = "local-nats"
	channel        = "goa.timeout.tester.publisher-1%d.%d"
	publisherCount = 35
	clientCount    = 100
	messageCount   = 1000
)

func main() {
	var wg sync.WaitGroup
	wg.Add(clientCount)

	for client := 0; client < clientCount; client++ {
		go func(client int) {
			defer wg.Done()

			// Nats connection
			natsConnection, err := nats.Connect(serverURL, nats.MaxReconnects(-1))
			if err != nil {
				panic(err)
			}

			// Streaming connection
			connectionsOptions, err := stan.Connect(clusterID, fmt.Sprintf(tenant, client), stan.NatsConn(natsConnection), stan.Pings(10, 100))
			if err != nil {
				panic(err)
			}

			wg.Add(publisherCount)
			for p := 0; p < publisherCount; p++ {
				go publisher(client, p, connectionsOptions, &wg)
			}

		}(client)
	}

	wg.Wait()
}

func publisher(client, publisher int, stanConnection stan.Conn, wg *sync.WaitGroup) {
        defer wg.Done()
	log.Printf("publication started on channel: %s\n", fmt.Sprintf(channel, client, publisher))

	for messages := 0; messages < messageCount; messages++ {
		uid, err := stanConnection.PublishAsync(fmt.Sprintf(channel, client, publisher), []byte(`Message of publisher`), func(ackedNuid string, err error) {
			if err != nil {
				log.Printf("Warning: error publishing msg id %s: %v\n", ackedNuid, err.Error())
			}
		})

		if err != nil {
			log.Println(uid, err)
		}
	}

	log.Printf("publication finished on channel: %s\n", fmt.Sprintf(channel, client, publisher))

}

The other application is EQUAL however, the number of messages sent is much larger (just to generate the exception.)

const (
	tenant         = "goa-timeout-tester-1%d"
	serverURL      = "nats://ANOTHER_SERVER(NOT-LOCAL-HOST):4222"
	clusterID      = "local-nats"
	channel        = "goa.timeout.tester.publisher-1%d.%d"
	publisherCount = 35
	clientCount    = 1
	messageCount   = 100000
)

The error seems to be generic: Screenshot_20191114_122537

Versions:

nats-streaming-server: 0.16.2
nats.io: master
stan.io: master

I believe it is normal for the first application to generate Timeout due to the amount of submissions it is doing, however, the second application returns a time out even though the cluster is healthy, does not seem normal to me.

I am available to clarify anything! Thanks in advance.

renanberto avatar Nov 14 '19 15:11 renanberto