Readloop processing time when using authorization [v2.9.16]
Observed behavior
A server running only few hundreds of connections (centos 9, 2 CPUs, 4GB RAM running on GCP), is working fine with this config:
listen: 0.0.0.0:4222
websocket {
listen: 0.0.0.0:9222
compression: false
no_tls: true
}
When we add an auth section with 9 users, like so:
authorization {
timeout: 1
LIMITED_USER = {
subscribe = ["somechan.>", "otherchan.>"]
publish = {
deny: ">"
}
}
users [
{ user: user1, password: "xxx" }
{ user: user2, password: "yyy" }
{ user: user3, password: "zzzz" }
....
{ user: user8, password: "....", permissions: $LIMITED_USER }
{ user: user9, password: "...." }
]
}
2 sec after starting the service we are getting
2024/03/28 11:25:51.523011 [WRN] x.y.z.a:10998 - cid:76885 - Readloop processing time: 4.322702242s
after a minute or so, it goes up
2024/03/28 11:25:51.523011 [WRN] x.y.z.a:10998 - cid:76885 - Readloop processing time: 17.422702242s
also, nats is taking almost 100% CPU, clients are disconnecting
We have upgraded the server (4 CPUs 16G RAM) and it did not solve the issue (although looks like Readloop processing time was less than 10 sec)
Removed the auth section - and all is back to normal.
Expected behavior
The server should run with no error, clients should connect easily
Server and client version
We tried both version (which worked before for months): 2.9.16 and 2.9.21
Host environment
Originally we used (for months) GCP VM, running Centos 7.0 2 CPUs, 4GB RAM
When it failed, we moved to GCP VM, running Centos 9.0 2 CPUs, 4GB RAM
When it failed, we upgraded to GCP VM, running Centos 9.0 4 CPUs, 16GB RAM
None worked...
Steps to reproduce
No response
Hi @orchestd will check, but just to confirm that you only saw this in v2.9.X series?
Hi.
2.9.16 was the version we're using, changing the version to 2.9.21 was just "in case" the specific version we chose was buggy, we can try a diff one .
BTW, we have reconstructed the issue in a more vanilla manner
- wrote a small program that connect nats
- ran the nats server with auth - that has 1 user (alice:foo)
- ran the below code X 1000 times on parallel
- got the error
then we ran the same without auth - all was ok
code:
package main
import (
"fmt"
"github.com/nats-io/nats.go"
"time"
)
func main() {
testConnection("defaultBackend", "nats://some.server:4222", "alice", "foo")
}
func testConnection(entity string, url string, username string, password string) {
nc, err := nats.Connect(url, nats.UserInfo(username, password), nats.Name(entity)
if err != nil {
fmt.Println("ERROR: " + err.Error())
}
defer nc.Close()
}
Update: changing
authorization {
timeout: 1
to
authorization {
timeout: 8
"solved" the issue, but now its a risk of false authorization window
Any ideas on how can I avoid the Readloop processing time issue without setting authorization timeout to a large number which is a security risk ?
Can you share the complete server config file that accompanies the test that shows the issue?
The COMPLETE configuration is on the first post (nothing fancy needed on config - its the bare minimum)
reposting here:
listen: 0.0.0.0:4222
websocket {
listen: 0.0.0.0:9222
compression: false
no_tls: true
}
authorization {
timeout: 4
SOME_PERM = {
subscribe = ["xxx.>", "yyy.>"]
publish = {
deny: ">"
}
}
users [
{ user: super, password: "....." }
{ user: regular, password: "....", permissions: $SOME_PERM }
]
}
So I used your config file and your program and ran it in parallel 100 times with no issues.
So we must be doing something different. So I am running a v2.10.15 server with the config above and the program (changed the user/pass). So could you show exactly what command you used to run the program and the exact program code used?
We run nats-server -c server.conf no extra params
As for the code, we have various services, including WS frontend When we look at the error 2024/03/28 11:25:51.523011 [WRN] x.y.z.a:10998 - cid:76885 - Readloop processing time: 17.422702242s
most of the x.y.z.a IP addresses are go backends using this lib to connect to nats https://github.com/orchestd/nats.io using NewNatsServiceWithBasicAuth and then connect:
func (n *defaultNatsService) Connect(natsUrl, serviceName string, authOpt nats.Option,
connectionAttempts, reconnectionAttempts, reconnectWaitSec, maxPingsOutstanding, pingIntervalSec int) error {
opts := []nats.Option{nats.Name(serviceName)}
opts = append(opts, authOpt)
opts = append(opts, nats.ReconnectWait(time.Duration(reconnectWaitSec)*time.Second))
opts = append(opts, nats.MaxReconnects(reconnectionAttempts))
opts = append(opts, nats.MaxPingsOutstanding(maxPingsOutstanding))
opts = append(opts, nats.PingInterval(time.Duration(pingIntervalSec)*time.Second))
opts = append(opts, nats.ErrorHandler(func(_ *nats.Conn, s *nats.Subscription, err error) {
n.logger.Error(context.Background(), n.formatErrorMsg("nats.ErrorHandler fired", err))
}))
opts = append(opts, nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
n.logger.Error(context.Background(), n.formatErrorMsg("nats.DisconnectErrHandler fired", err))
}))
opts = append(opts, nats.ReconnectHandler(func(nc *nats.Conn) {
n.logger.Warn(context.Background(), n.formatErrorMsg("nats.ReconnectHandler fired", nil))
}))
err := n.connect(natsUrl, opts)
if err != nil {
n.logger.Error(context.Background(), n.formatErrorMsg("can't connect to nats server", err))
if connectionAttempts < 2 {
return err
}
go func() {
var connectionError error
for i := 1; i <= connectionAttempts; i++ {
time.Sleep(time.Duration(reconnectWaitSec) * time.Second)
connectionError = n.connect(natsUrl, opts)
if connectionError == nil {
break
}
n.logger.Error(context.Background(), n.formatErrorMsg("can't connect to nats server attempt - "+fmt.Sprint(i), connectionError))
}
connectionError = fmt.Errorf(n.formatErrorMsg(fmt.Sprintf("can't connect to nats server after %v attempts. Stop trying.", connectionAttempts), connectionError))
if n.connectionFailedHandler != nil {
n.connectionFailedHandler(connectionError)
}
n.logger.Error(context.Background(), connectionError.Error())
}()
return nil
}
return nil
}
Hope this sheds more light
Just a hint that would simplify the code a lot:
You can use RetryOnFailedConnect option, that would allow to kick in the built-in nats clients reconnecting option even if initial connection failed.
Otherwise I feel it's easy to spawn a lot of new connections.