helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

Worker nodes restarting due to failing liveliness probes

Open Kircheneer opened this issue 3 years ago • 3 comments

Issue

I am trying to run this chart with my MacOS/Docker/K3S setup and I'm facing a problem where all the worker nodes (both celery and nautobot) constantly restart due to failed liveliness/readiness probes. I have tried both v1.2.0 and v1.1.15 (the following outputs are from v1.2.0). I have read #55 and #57 but I don't think this applies, as I'm running a very default configuration and also have set nautobot.redis.ssl to false. Also I don't think this is a OOM scenario, as from the many times I ran kubectl describe pod I never saw the OOM killed messages mentioned in #55 and theres still 2GB of RAM headroom on the cluster.

Configuration

postgresql:
  postgresqlPassword: "password"
redis:
  auth:
    password: "password"
nautobot:
  redis:
    ssl: false

Debug output

Celery worker output:

k logs nautobot-celery-worker-d65589fbc-swk7t -f
 
 -------------- celery@nautobot-celery-worker-d65589fbc-swk7t v5.2.3 (dawn-chorus)
--- ***** ----- 
-- ******* ---- Linux-5.10.76-linuxkit-x86_64-with-glibc2.31 2022-02-01 12:05:33
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         nautobot:0x4002dd5670
- ** ---------- .> transport:   redis://:**@nautobot-redis-headless:6379/0
- ** ---------- .> results:     redis://:**@nautobot-redis-headless:6379/0
- *** --- * --- .> concurrency: 5 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery
                

[tasks]
  . nautobot.extras.datasources.git.pull_git_repository_and_refresh_data
  . nautobot.extras.jobs.run_job
  . nautobot.extras.jobs.scheduled_job_handler
  . nautobot.extras.tasks.delete_custom_field_data
  . nautobot.extras.tasks.process_webhook
  . nautobot.extras.tasks.provision_field
  . nautobot.extras.tasks.update_custom_field_choice_data
  . nautobot.utilities.tasks.get_releases

[2022-02-01 12:05:35,156: INFO/MainProcess] Connected to redis://:**@nautobot-redis-headless:6379/0
[2022-02-01 12:05:35,176: INFO/MainProcess] mingle: searching for neighbors
[2022-02-01 12:05:36,216: INFO/MainProcess] mingle: all alone
[2022-02-01 12:05:36,270: INFO/MainProcess] celery@nautobot-celery-worker-d65589fbc-swk7t ready.
[2022-02-01 12:05:39,746: INFO/MainProcess] sync with celery@nautobot-celery-worker-d65589fbc-dpjk2

worker: Warm shutdown (MainProcess)

Nautobot worker output:

k logs nautobot-68bcb95f6b-5tdtw -f
Performing database migrations...
Operations to perform:
  Apply all migrations: admin, auth, circuits, contenttypes, database, dcim, django_celery_beat, extras, ipam, sessions, social_django, taggit, tenancy, users, virtualization
Running migrations:
  No migrations to apply.

Generating cable paths...
Found no missing circuit termination paths; skipping
Found no missing console port paths; skipping
Found no missing console server port paths; skipping
Found no missing interface paths; skipping
Found no missing power feed paths; skipping
Found no missing power outlet paths; skipping
Found no missing power port paths; skipping
Finished.

Collecting static files...

974 static files copied to '/opt/nautobot/static'.

Removing stale content types...

Removing expired sessions...

⏳ Running initial systems check...
System check identified some issues:

WARNINGS:
?: (security.W004) You have not set a value for the SECURE_HSTS_SECONDS setting. If your entire site is served only over SSL, you may want to consider setting a value and enabling HTTP Strict Transport Security. Be sure to read the documentation first; enabling HSTS carelessly can cause serious, irreversible problems.
?: (security.W008) Your SECURE_SSL_REDIRECT setting is not set to True. Unless your site should be available over both SSL and non-SSL connections, you may want to either set this setting True or configure a load balancer or reverse-proxy server to redirect all connections to HTTPS.
?: (security.W012) SESSION_COOKIE_SECURE is not set to True. Using a secure-only session cookie makes it more difficult for network traffic sniffers to hijack user sessions.
?: (security.W016) You have 'django.middleware.csrf.CsrfViewMiddleware' in your MIDDLEWARE, but you have not set CSRF_COOKIE_SECURE to True. Using a secure-only CSRF cookie makes it more difficult for network traffic sniffers to steal the CSRF token.

System check identified 4 issues (0 silenced).

kubectl describe pod output for a nautobot worked node:

k describe pod nautobot-68bcb95f6b-5tdtw 
Name:         nautobot-68bcb95f6b-5tdtw
Namespace:    default
Priority:     0
Node:         k3d-k3d-rancher-server-0/172.20.0.2
Start Time:   Tue, 01 Feb 2022 12:58:56 +0100
Labels:       app.kubernetes.io/component=nautobot
              app.kubernetes.io/instance=nautobot
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=nautobot
              helm.sh/chart=nautobot-1.2.0
              pod-template-hash=68bcb95f6b
Annotations:  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:       Running
IP:           10.42.0.49
IPs:
  IP:           10.42.0.49
Controlled By:  ReplicaSet/nautobot-68bcb95f6b
Containers:
  nautobot:
    Container ID:   containerd://e7d1478996a8d51fa23572314ff4d6b7b350f8c7a590cc5e48d4cdc9f62a2b20
    Image:          ghcr.io/nautobot/nautobot:1.2.4-py3.9
    Image ID:       ghcr.io/nautobot/nautobot@sha256:372e6e638c62ca388cfce42e96b8d9450c00210562c1ce2de04af49d4ed83536
    Ports:          8443/TCP, 8080/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 01 Feb 2022 13:36:17 +0100
      Finished:     Tue, 01 Feb 2022 13:37:46 +0100
    Ready:          False
    Restart Count:  13
    Limits:
      cpu:     600m
      memory:  8704M
    Requests:
      cpu:      100m
      memory:   1280M
    Liveness:   http-get http://:http/health/ delay=30s timeout=5s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/health/ delay=30s timeout=5s period=10s #success=1 #failure=3
    Environment Variables from:
      nautobot-env  ConfigMap  Optional: false
      nautobot-env  Secret     Optional: false
    Environment:    <none>
    Mounts:
      /opt/nautobot/git from git-repos (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-295rr (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  git-repos:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-295rr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  39m                   default-scheduler  Successfully assigned default/nautobot-68bcb95f6b-5tdtw to k3d-k3d-rancher-server-0
  Normal   Pulled     39m                   kubelet            Successfully pulled image "ghcr.io/nautobot/nautobot:1.2.4-py3.9" in 518.130084ms
  Normal   Pulling    38m (x2 over 39m)     kubelet            Pulling image "ghcr.io/nautobot/nautobot:1.2.4-py3.9"
  Normal   Pulled     38m                   kubelet            Successfully pulled image "ghcr.io/nautobot/nautobot:1.2.4-py3.9" in 534.001792ms
  Normal   Created    38m (x2 over 39m)     kubelet            Created container nautobot
  Normal   Started    38m (x2 over 39m)     kubelet            Started container nautobot
  Normal   Killing    37m (x2 over 38m)     kubelet            Container nautobot failed liveness probe, will be restarted
  Warning  Unhealthy  19m (x50 over 39m)    kubelet            Readiness probe failed: Get "http://10.42.0.49:8080/health/": dial tcp 10.42.0.49:8080: connect: connection refused
  Warning  Unhealthy  9m34s (x36 over 39m)  kubelet            Liveness probe failed: Get "http://10.42.0.49:8080/health/": dial tcp 10.42.0.49:8080: connect: connection refused
  Warning  BackOff    4m29s (x90 over 29m)  kubelet            Back-off restarting failed container

kubectl describe pod output for a celery worked node:

k describe pod nautobot-celery-worker-d65589fbc-dpjk2 
Name:         nautobot-celery-worker-d65589fbc-dpjk2
Namespace:    default
Priority:     0
Node:         k3d-k3d-rancher-server-0/172.20.0.2
Start Time:   Tue, 01 Feb 2022 12:59:08 +0100
Labels:       app.kubernetes.io/component=nautobot-celery-worker
              app.kubernetes.io/instance=nautobot
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=nautobot
              helm.sh/chart=nautobot-1.2.0
              pod-template-hash=d65589fbc
Annotations:  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:       Running
IP:           10.42.0.55
IPs:
  IP:           10.42.0.55
Controlled By:  ReplicaSet/nautobot-celery-worker-d65589fbc
Containers:
  nautobot-celery:
    Container ID:  containerd://1fc62cd4e4ca2daaed517bd19543983508acbd1631fc0c3e33bdda4d69d4e318
    Image:         ghcr.io/nautobot/nautobot:1.2.4-py3.9
    Image ID:      ghcr.io/nautobot/nautobot@sha256:372e6e638c62ca388cfce42e96b8d9450c00210562c1ce2de04af49d4ed83536
    Port:          <none>
    Host Port:     <none>
    Command:
      nautobot-server
      celery
      worker
      --loglevel
      $(NAUTOBOT_LOG_LEVEL)
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 01 Feb 2022 13:31:21 +0100
      Finished:     Tue, 01 Feb 2022 13:34:20 +0100
    Ready:          False
    Restart Count:  9
    Limits:
      cpu:     3328m
      memory:  6656M
    Requests:
      cpu:      400m
      memory:   1G
    Liveness:   exec [bash -c nautobot-server celery inspect ping --destination celery@$HOSTNAME] delay=30s timeout=10s period=60s #success=1 #failure=3
    Readiness:  exec [bash -c nautobot-server celery inspect ping --destination celery@$HOSTNAME] delay=30s timeout=10s period=60s #success=1 #failure=3
    Environment Variables from:
      nautobot-env  ConfigMap  Optional: false
      nautobot-env  Secret     Optional: false
    Environment:    <none>
    Mounts:
      /opt/nautobot/git from git-repos (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjq2w (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  git-repos:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-cjq2w:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  40m                  default-scheduler  0/1 nodes are available: 1 Insufficient memory.
  Warning  FailedScheduling  40m (x1 over 40m)    default-scheduler  0/1 nodes are available: 1 Insufficient memory.
  Normal   Scheduled         40m                  default-scheduler  Successfully assigned default/nautobot-celery-worker-d65589fbc-dpjk2 to k3d-k3d-rancher-server-0
  Normal   Pulled            40m                  kubelet            Successfully pulled image "ghcr.io/nautobot/nautobot:1.2.4-py3.9" in 514.728042ms
  Normal   Pulled            37m                  kubelet            Successfully pulled image "ghcr.io/nautobot/nautobot:1.2.4-py3.9" in 513.247459ms
  Normal   Killing           34m (x2 over 37m)    kubelet            Container nautobot-celery failed liveness probe, will be restarted
  Normal   Pulling           34m (x3 over 40m)    kubelet            Pulling image "ghcr.io/nautobot/nautobot:1.2.4-py3.9"
  Normal   Pulled            34m                  kubelet            Successfully pulled image "ghcr.io/nautobot/nautobot:1.2.4-py3.9" in 514.411292ms
  Normal   Created           34m (x3 over 40m)    kubelet            Created container nautobot-celery
  Normal   Started           34m (x3 over 40m)    kubelet            Started container nautobot-celery
  Warning  Unhealthy         10m (x25 over 39m)   kubelet            Liveness probe failed:
  Warning  Unhealthy         5m7s (x30 over 39m)  kubelet            Readiness probe failed:
  Warning  BackOff           12s (x49 over 16m)   kubelet            Back-off restarting failed container

Kircheneer avatar Feb 01 '22 12:02 Kircheneer

Can you post the logs from the crashed containers using kubectl logs --previous? From the describe output it looks like the containers crashed/errored out, didn't OOM. Are you using docker desktop? I'm not a k3s user so sorry I can't speak to knowledgeably about that, however, on MacOS you are likely running a VM to run the containers (whether you know it or not). In Docker Desktop the specs for the VM are exposed in the UI in Preferences/Resources/Advanced. It's possible these VM specs need to be increased. If you're not using docker desktop you'll have to figure out how to bump those up.

nniehoff avatar Feb 01 '22 14:02 nniehoff

The logs posted have been collected with -f, the end of the snippets is the point where that command terminated and I assume the containers did so as well. Running with previous yields the same results as those posted above. I am using Docker desktop (currently 4.3.2). I have bumped the RAM for the Docker Desktop VM to 8GB prior to the collection of the above logs. The resource usage metrics in Rancher suggest this is plenty for both RAM and CPU cores.

Kircheneer avatar Feb 01 '22 14:02 Kircheneer

I ran into the same issue. During my debugging I noticed it takes a while for uWSGI to actually serve HTTP, almost as long as initialDelaySeconds + periodSeconds * failureThreshold. Tweaking with the livelinessProbe made it work.

By the way: I'd suggest using startupProbe instead of using initialDelaySeconds, it's recommended these days.

jonkerj avatar Feb 16 '23 19:02 jonkerj

Is this still relevant? Or I should proceed closing it?

gertzakis avatar Aug 22 '24 11:08 gertzakis

Not relevant for me at this time, so I guess the issue can be closed. Does it make sense to take into account @jonkerj's suggestion?

Kircheneer avatar Aug 22 '24 11:08 Kircheneer

Sure, I will create another one for that. I will close this one for now. Thanks a lot

gertzakis avatar Aug 22 '24 11:08 gertzakis