nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

Readloop processing time when using authorization [v2.9.16]

Open orchestd opened this issue 2 years ago • 9 comments

Observed behavior

A server running only few hundreds of connections (centos 9, 2 CPUs, 4GB RAM running on GCP), is working fine with this config:

listen: 0.0.0.0:4222
websocket {
  listen: 0.0.0.0:9222
  compression: false
  no_tls: true
}

When we add an auth section with 9 users, like so:

authorization {
  timeout: 1

  LIMITED_USER = {
    subscribe = ["somechan.>", "otherchan.>"]
    publish = {
        deny: ">"
    }
  }

  users [
    { user: user1, password: "xxx" }    
    { user: user2, password: "yyy" }
    { user: user3, password: "zzzz" }
....
    { user: user8, password: "....", permissions: $LIMITED_USER }
    { user: user9, password: "...." }
  ]
}

2 sec after starting the service we are getting

2024/03/28 11:25:51.523011 [WRN] x.y.z.a:10998 - cid:76885 - Readloop processing time: 4.322702242s

after a minute or so, it goes up

2024/03/28 11:25:51.523011 [WRN] x.y.z.a:10998 - cid:76885 - Readloop processing time: 17.422702242s

also, nats is taking almost 100% CPU, clients are disconnecting

We have upgraded the server (4 CPUs 16G RAM) and it did not solve the issue (although looks like Readloop processing time was less than 10 sec)

Removed the auth section - and all is back to normal.

Expected behavior

The server should run with no error, clients should connect easily

Server and client version

We tried both version (which worked before for months): 2.9.16 and 2.9.21

Host environment

Originally we used (for months) GCP VM, running Centos 7.0 2 CPUs, 4GB RAM

When it failed, we moved to GCP VM, running Centos 9.0 2 CPUs, 4GB RAM

When it failed, we upgraded to GCP VM, running Centos 9.0 4 CPUs, 16GB RAM

None worked...

Steps to reproduce

No response

orchestd avatar Mar 28 '24 11:03 orchestd

Hi @orchestd will check, but just to confirm that you only saw this in v2.9.X series?

wallyqs avatar Mar 28 '24 13:03 wallyqs

Hi.

2.9.16 was the version we're using, changing the version to 2.9.21 was just "in case" the specific version we chose was buggy, we can try a diff one .

BTW, we have reconstructed the issue in a more vanilla manner

  1. wrote a small program that connect nats
  2. ran the nats server with auth - that has 1 user (alice:foo)
  3. ran the below code X 1000 times on parallel
  4. got the error

then we ran the same without auth - all was ok

code:

package main

import (
	"fmt"
	"github.com/nats-io/nats.go"
	"time"
)


func main() {
	testConnection("defaultBackend", "nats://some.server:4222", "alice", "foo")

}

func testConnection(entity string, url string, username string, password string) {

	nc, err := nats.Connect(url, nats.UserInfo(username, password), nats.Name(entity)
	if err != nil {
		fmt.Println("ERROR: " + err.Error())
	}

	defer nc.Close()

}

orchestd avatar Mar 28 '24 14:03 orchestd

Update: changing

authorization {
  timeout: 1

to

authorization {
  timeout: 8

"solved" the issue, but now its a risk of false authorization window

orchestd avatar Apr 01 '24 08:04 orchestd

Any ideas on how can I avoid the Readloop processing time issue without setting authorization timeout to a large number which is a security risk ?

orchestd avatar Apr 21 '24 08:04 orchestd

Can you share the complete server config file that accompanies the test that shows the issue?

derekcollison avatar Apr 22 '24 10:04 derekcollison

The COMPLETE configuration is on the first post (nothing fancy needed on config - its the bare minimum)

reposting here:

listen: 0.0.0.0:4222

websocket {
  listen: 0.0.0.0:9222
  compression: false
  no_tls: true
}

authorization {
    timeout: 4
    
    SOME_PERM = {
    subscribe = ["xxx.>", "yyy.>"]
    publish = {
    deny: ">"
    }
    }
    
    users [
    { user: super, password: "....." }
    { user: regular, password: "....", permissions: $SOME_PERM }
    ]
    }

orchestd avatar Apr 24 '24 08:04 orchestd

So I used your config file and your program and ran it in parallel 100 times with no issues.

So we must be doing something different. So I am running a v2.10.15 server with the config above and the program (changed the user/pass). So could you show exactly what command you used to run the program and the exact program code used?

derekcollison avatar Apr 25 '24 04:04 derekcollison

We run nats-server -c server.conf no extra params

As for the code, we have various services, including WS frontend When we look at the error 2024/03/28 11:25:51.523011 [WRN] x.y.z.a:10998 - cid:76885 - Readloop processing time: 17.422702242s

most of the x.y.z.a IP addresses are go backends using this lib to connect to nats https://github.com/orchestd/nats.io using NewNatsServiceWithBasicAuth and then connect:

func (n *defaultNatsService) Connect(natsUrl, serviceName string, authOpt nats.Option,
	connectionAttempts, reconnectionAttempts, reconnectWaitSec, maxPingsOutstanding, pingIntervalSec int) error {
	opts := []nats.Option{nats.Name(serviceName)}
	opts = append(opts, authOpt)

	opts = append(opts, nats.ReconnectWait(time.Duration(reconnectWaitSec)*time.Second))

	opts = append(opts, nats.MaxReconnects(reconnectionAttempts))

	opts = append(opts, nats.MaxPingsOutstanding(maxPingsOutstanding))

	opts = append(opts, nats.PingInterval(time.Duration(pingIntervalSec)*time.Second))

	opts = append(opts, nats.ErrorHandler(func(_ *nats.Conn, s *nats.Subscription, err error) {

		n.logger.Error(context.Background(), n.formatErrorMsg("nats.ErrorHandler fired", err))
	}))

	opts = append(opts, nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
		n.logger.Error(context.Background(), n.formatErrorMsg("nats.DisconnectErrHandler fired", err))
	}))

	opts = append(opts, nats.ReconnectHandler(func(nc *nats.Conn) {
		n.logger.Warn(context.Background(), n.formatErrorMsg("nats.ReconnectHandler fired", nil))
	}))

	err := n.connect(natsUrl, opts)
	if err != nil {
		n.logger.Error(context.Background(), n.formatErrorMsg("can't connect to nats server", err))
		if connectionAttempts < 2 {
			return err
		}
		go func() {
			var connectionError error
			for i := 1; i <= connectionAttempts; i++ {
				time.Sleep(time.Duration(reconnectWaitSec) * time.Second)
				connectionError = n.connect(natsUrl, opts)
				if connectionError == nil {
					break
				}
				n.logger.Error(context.Background(), n.formatErrorMsg("can't connect to nats server attempt - "+fmt.Sprint(i), connectionError))
			}

			connectionError = fmt.Errorf(n.formatErrorMsg(fmt.Sprintf("can't connect to nats server after %v attempts. Stop trying.", connectionAttempts), connectionError))

			if n.connectionFailedHandler != nil {
				n.connectionFailedHandler(connectionError)
			}
			n.logger.Error(context.Background(), connectionError.Error())
		}()
		return nil
	}
	return nil
}

Hope this sheds more light

orchestd avatar May 08 '24 09:05 orchestd

Just a hint that would simplify the code a lot: You can use RetryOnFailedConnect option, that would allow to kick in the built-in nats clients reconnecting option even if initial connection failed.

Otherwise I feel it's easy to spawn a lot of new connections.

Jarema avatar May 15 '24 13:05 Jarema