nats-server unhealthy after upgrade to 2.8.2
We decided to upgrade our nat-server from 2.8.1 to 2.8.2 . This is a 3 node nats cluster running on k8s and we are using jetstream. After running the helm upgrade command, pod 1, 2 became healthy but pod-0 remains unhealthy. I see the following error in the logs:
[141] 2022/05/05 17:10:23.842745 [WRN] Healthcheck failed: "JetStream stream 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s125' is not current"
[141] 2022/05/05 17:10:33.842910 [WRN] Healthcheck failed: "JetStream stream 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s107' is not current"
[141] 2022/05/05 17:10:43.843105 [WRN] Healthcheck failed: "JetStream stream
'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s107' is not current"
[141] 2022/05/05 17:10:53.842605 [WRN] Healthcheck failed: "JetStream stream 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s125' is not current"
[141] 2022/05/05 17:11:03.842650 [WRN] Healthcheck failed: "JetStream stream 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s46' is not current"
[141] 2022/05/05 17:11:13.843200 [WRN] Healthcheck failed: "JetStream stream 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s125' is not current"
The server logs are attached here. nats-0.log nats-1.log nats-2.log
I would also like to add that we have found jetstream raft to be generally shaky when it comes to upgrading. 3 weeks ago we upgraded from 2.6.x to 2.8.1. At that time, we also had RAFT / JS related issues and had to delete the PVCs to bring everything back to healthy state. I would appreciate any help on how to solve this and any recommendation on upgrading nats-server in general.
We have been upgrading JetStream for stability and additional feature completeness at a fairly rapid rate. Are hope is that this is quickly stabilizing and things will be better as we continue to improve. Thanks for your patience.
I noticed you have some log items that are showing you are failing to fetch accounts from the account resolver. Might be good for you to give us an overview of your setup.
Thanks, @derekcollison ! We are using the nats built-in resolver. This is what our nats helm chart values file look like. I have changed the domain and curtailed the nkeys/jwts before sharing.
#
# helm upgrade --install -n bb nats-server nats/nats -f 2021/nats_cluster_values_qa.yaml
#
# nameOverride: nats-server
useFQDN: false
nodeSelector:
lke.linode.com/pool-id: "xyz"
statefulSetAnnotations:
secret.reloader.stakater.com/reload: "tls-cert"
nats:
client:
port: 4222
# healthcheck:
# startup:
# enabled: false
# liveness:
# enabled: false
advertise: false
externalAccess: true
image: nats:2.8.2-alpine
limits:
maxPayload: 32Mb
logging:
debug: false
trace: false
jetstream:
enabled: true
fileStorage:
enabled: true
size: 50Gi
storageDirectory: /nats/jetstream
resources:
requests:
memory: 2Gi
limits:
memory: 2Gi
tls:
secret:
name: tls-cert
cert: tls.crt
key: tls.key
timeout: 60s
allowNonTLS: false
insecure: false
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
namespaces:
- bb
labelSelector:
matchLabels:
app.kubernetes.io/instance: nats-server
app.kubernetes.io/name: nats
topologyKey: kubernetes.io/hostname
cluster:
enabled: true
replicas: 3
natsbox:
enabled: false
reloader:
enabled: true
image: natsio/nats-server-config-reloader:0.6.3
pullPolicy: IfNotPresent
exporter:
enabled: true
image: natsio/prometheus-nats-exporter:0.9.1
pullPolicy: IfNotPresent
# Prometheus operator ServiceMonitor support. Exporter has to be enabled
serviceMonitor:
enabled: true
namespace: monitoring
labels:
release: prometheus
path: /metrics
auth:
enabled: true
timeout: 10s
operatorjwt:
configMap:
name: nats-credentials
key: Operator.jwt
systemAccount: ACWXVSUJK5S7PXX5
resolver:
type: full
operator: eyJ0eXAwjAg
systemAccount: ACWXXX5
store:
dir: /etc/nats-config/accounts/jwt
size: 10Gi
resolverPreload:
ACWXX5: eyJF0Cg
websocket:
enabled: true
port: 443
allowedOrigins:
- https://example.com
- https://console.example.com
- https://kubedb.example.com
- https://grafana.example.com
tls:
secret:
name: tls-cert
cert: tls.crt
key: tls.key
timeout: 60s
I did another round of restarts to turn on debug and trace level logging. This time all the pods became healthy.
Will loop in @wallyqs and @caleb to take a look at the helm chart.
This is the final generated config used by nats-server.
/etc # cat /etc/nats-config/nats.conf
# NATS Clients Port
port: 4222
# PID file shared with configuration reloader.
pid_file: "/var/run/nats/nats.pid"
###############
# #
# Monitoring #
# #
###############
http: 8222
server_name:$POD_NAME
#####################
# #
# TLS Configuration #
# #
#####################
tls {
cert_file: /etc/nats-certs/clients/tls-cert/tls.crt
key_file: /etc/nats-certs/clients/tls-cert/tls.key
timeout: 60s
}
###################################
# #
# NATS JetStream #
# #
###################################
jetstream {
max_mem: 1Gi
store_dir: /nats/jetstream
max_file:50Gi
}
###################################
# #
# NATS Full Mesh Clustering Setup #
# #
###################################
cluster {
port: 6222
name: nats
routes = [
nats://nats-server-0.nats-server.bb:6222,nats://nats-server-1.nats-server.bb:6222,nats://nats-server-2.nats-server.bb:6222,
]
cluster_advertise: $CLUSTER_ADVERTISE
connect_retries: 120
}
max_payload: 32Mb
lame_duck_grace_period: 10s
lame_duck_duration: 30s
##################
# #
# Websocket #
# #
##################
websocket {
port: 443
tls {
cert_file: /etc/nats-certs/ws/tls-cert/tls.crt
key_file: /etc/nats-certs/ws/tls-cert/tls.key
}
same_origin: false
allowed_origins: ["https://example.com","https://console.example.com","https://kubedb.example.com","https://grafana.example.com"]
}
##################
# #
# Authorization #
# #
##################
authorization {
timeout: 10s
}
operator: eyJ0ewjAg
system_account: ACWXVPXX5
resolver: {
type: full
dir: "/etc/nats-config/accounts/jwt"
allow_delete: false
interval: "2m"
}
resolver_preload: {"ACWF0Cg"}
system_account: ACWXJPXX5
Looped in a few folks to peek at all the configs.
Requests/limits look a little low. I'd recommend a request of around 2 CPU and 8Gi RAM for a production setup
What is the Storage system backing this? Is it fast block storage? We don't recommend shared filesystems like NFS
This is our QA instance. So, it is underpowered / under-provisioned. The production instance has 6 cpu, 14 GB ram .
The storage backend is the cloud storage by Linode (NVME).
@tamalsaha Closing since this is quite old, including the server version. Feel free to reopen/open a new issue if you have any more questions.