nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

nats-server unhealthy after upgrade to 2.8.2

Open tamalsaha opened this issue 3 years ago • 9 comments

We decided to upgrade our nat-server from 2.8.1 to 2.8.2 . This is a 3 node nats cluster running on k8s and we are using jetstream. After running the helm upgrade command, pod 1, 2 became healthy but pod-0 remains unhealthy. I see the following error in the logs:

[141] 2022/05/05 17:10:23.842745 [WRN] Healthcheck failed: "JetStream stream 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s125' is not current"
[141] 2022/05/05 17:10:33.842910 [WRN] Healthcheck failed: "JetStream stream 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s107' is not current"
[141] 2022/05/05 17:10:43.843105 [WRN] Healthcheck failed: "JetStream stream
 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s107' is not current"
[141] 2022/05/05 17:10:53.842605 [WRN] Healthcheck failed: "JetStream stream 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s125' is not current"
[141] 2022/05/05 17:11:03.842650 [WRN] Healthcheck failed: "JetStream stream 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s46' is not current"
[141] 2022/05/05 17:11:13.843200 [WRN] Healthcheck failed: "JetStream stream 'ABVUPGICKJDHVA72VTGROSUWR3WESAAPULWZKGA46VZ25CLS6QNSJNOV > events-s125' is not current"

The server logs are attached here. nats-0.log nats-1.log nats-2.log

I would also like to add that we have found jetstream raft to be generally shaky when it comes to upgrading. 3 weeks ago we upgraded from 2.6.x to 2.8.1. At that time, we also had RAFT / JS related issues and had to delete the PVCs to bring everything back to healthy state. I would appreciate any help on how to solve this and any recommendation on upgrading nats-server in general.

tamalsaha avatar May 05 '22 17:05 tamalsaha

We have been upgrading JetStream for stability and additional feature completeness at a fairly rapid rate. Are hope is that this is quickly stabilizing and things will be better as we continue to improve. Thanks for your patience.

derekcollison avatar May 05 '22 19:05 derekcollison

I noticed you have some log items that are showing you are failing to fetch accounts from the account resolver. Might be good for you to give us an overview of your setup.

derekcollison avatar May 05 '22 19:05 derekcollison

Thanks, @derekcollison ! We are using the nats built-in resolver. This is what our nats helm chart values file look like. I have changed the domain and curtailed the nkeys/jwts before sharing.

#
# helm upgrade --install -n bb nats-server nats/nats -f 2021/nats_cluster_values_qa.yaml
#

# nameOverride: nats-server

useFQDN: false
nodeSelector:
  lke.linode.com/pool-id: "xyz"
statefulSetAnnotations:
  secret.reloader.stakater.com/reload: "tls-cert"

nats:
  client:
    port: 4222

  # healthcheck:
  #   startup:
  #     enabled: false
  #   liveness:
  #     enabled: false

  advertise: false
  externalAccess: true

  image: nats:2.8.2-alpine
  limits:
    maxPayload: 32Mb

  logging:
    debug: false
    trace: false

  jetstream:
    enabled: true

    fileStorage:
      enabled: true
      size: 50Gi
      storageDirectory: /nats/jetstream


  resources:
    requests:
      memory: 2Gi
    limits:
      memory: 2Gi
  tls:
    secret:
      name: tls-cert
    cert: tls.crt
    key: tls.key
    timeout: 60s
    allowNonTLS: false
    insecure: false

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        namespaces:
        - bb
        labelSelector:
          matchLabels:
            app.kubernetes.io/instance: nats-server
            app.kubernetes.io/name: nats
        topologyKey: kubernetes.io/hostname

cluster:
  enabled: true
  replicas: 3

natsbox:
  enabled: false

reloader:
  enabled: true
  image: natsio/nats-server-config-reloader:0.6.3
  pullPolicy: IfNotPresent

exporter:
  enabled: true
  image: natsio/prometheus-nats-exporter:0.9.1
  pullPolicy: IfNotPresent
  # Prometheus operator ServiceMonitor support. Exporter has to be enabled
  serviceMonitor:
    enabled: true
    namespace: monitoring
    labels:
      release: prometheus
    path: /metrics

auth:
  enabled: true
  timeout: 10s
  operatorjwt:
    configMap:
      name: nats-credentials
      key: Operator.jwt
  systemAccount: ACWXVSUJK5S7PXX5
  resolver:
    type: full
    operator: eyJ0eXAwjAg
    systemAccount: ACWXXX5
    store:
      dir: /etc/nats-config/accounts/jwt
      size: 10Gi

    resolverPreload:
      ACWXX5: eyJF0Cg

websocket:
  enabled: true
  port: 443
  allowedOrigins:
    - https://example.com
    - https://console.example.com
    - https://kubedb.example.com
    - https://grafana.example.com
  tls:
    secret:
      name: tls-cert
    cert: tls.crt
    key: tls.key
    timeout: 60s

tamalsaha avatar May 05 '22 20:05 tamalsaha

I did another round of restarts to turn on debug and trace level logging. This time all the pods became healthy.

nats-0.log nats-1.log nats-2.log

tamalsaha avatar May 05 '22 22:05 tamalsaha

Will loop in @wallyqs and @caleb to take a look at the helm chart.

derekcollison avatar May 06 '22 05:05 derekcollison

This is the final generated config used by nats-server.

/etc # cat /etc/nats-config/nats.conf
# NATS Clients Port
port: 4222

# PID file shared with configuration reloader.
pid_file: "/var/run/nats/nats.pid"

###############
#             #
# Monitoring  #
#             #
###############
http: 8222
server_name:$POD_NAME
#####################
#                   #
# TLS Configuration #
#                   #
#####################
tls {
    cert_file: /etc/nats-certs/clients/tls-cert/tls.crt
    key_file:  /etc/nats-certs/clients/tls-cert/tls.key
    timeout: 60s
}
###################################
#                                 #
# NATS JetStream                  #
#                                 #
###################################
jetstream {
  max_mem: 1Gi
  store_dir: /nats/jetstream

  max_file:50Gi
}
###################################
#                                 #
# NATS Full Mesh Clustering Setup #
#                                 #
###################################
cluster {
  port: 6222
  name: nats

  routes = [
    nats://nats-server-0.nats-server.bb:6222,nats://nats-server-1.nats-server.bb:6222,nats://nats-server-2.nats-server.bb:6222,

  ]
  cluster_advertise: $CLUSTER_ADVERTISE

  connect_retries: 120
}
max_payload: 32Mb
lame_duck_grace_period: 10s
lame_duck_duration: 30s
##################
#                #
# Websocket      #
#                #
##################
websocket {
  port: 443

    tls {
    cert_file: /etc/nats-certs/ws/tls-cert/tls.crt
    key_file: /etc/nats-certs/ws/tls-cert/tls.key
    }
  same_origin: false
  allowed_origins: ["https://example.com","https://console.example.com","https://kubedb.example.com","https://grafana.example.com"]
}
##################
#                #
# Authorization  #
#                #
##################
        authorization {
          timeout: 10s
         }
        operator: eyJ0ewjAg
        system_account: ACWXVPXX5

      resolver: {
        type: full
        dir: "/etc/nats-config/accounts/jwt"

        allow_delete: false

        interval: "2m"
      }
  resolver_preload: {"ACWF0Cg"}
system_account: ACWXJPXX5

tamalsaha avatar May 06 '22 07:05 tamalsaha

Looped in a few folks to peek at all the configs.

derekcollison avatar May 06 '22 13:05 derekcollison

Requests/limits look a little low. I'd recommend a request of around 2 CPU and 8Gi RAM for a production setup

What is the Storage system backing this? Is it fast block storage? We don't recommend shared filesystems like NFS

caleblloyd avatar May 06 '22 15:05 caleblloyd

This is our QA instance. So, it is underpowered / under-provisioned. The production instance has 6 cpu, 14 GB ram .

The storage backend is the cloud storage by Linode (NVME).

tamalsaha avatar May 06 '22 16:05 tamalsaha

@tamalsaha Closing since this is quite old, including the server version. Feel free to reopen/open a new issue if you have any more questions.

bruth avatar May 04 '23 20:05 bruth