helm-charts
helm-charts copied to clipboard
Worker nodes restarting due to failing liveliness probes
Issue
I am trying to run this chart with my MacOS/Docker/K3S setup and I'm facing a problem where all the worker nodes (both celery and nautobot) constantly restart due to failed liveliness/readiness probes. I have tried both v1.2.0 and v1.1.15 (the following outputs are from v1.2.0). I have read #55 and #57 but I don't think this applies, as I'm running a very default configuration and also have set nautobot.redis.ssl to false. Also I don't think this is a OOM scenario, as from the many times I ran kubectl describe pod I never saw the OOM killed messages mentioned in #55 and theres still 2GB of RAM headroom on the cluster.
Configuration
postgresql:
postgresqlPassword: "password"
redis:
auth:
password: "password"
nautobot:
redis:
ssl: false
Debug output
Celery worker output:
k logs nautobot-celery-worker-d65589fbc-swk7t -f
-------------- celery@nautobot-celery-worker-d65589fbc-swk7t v5.2.3 (dawn-chorus)
--- ***** -----
-- ******* ---- Linux-5.10.76-linuxkit-x86_64-with-glibc2.31 2022-02-01 12:05:33
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: nautobot:0x4002dd5670
- ** ---------- .> transport: redis://:**@nautobot-redis-headless:6379/0
- ** ---------- .> results: redis://:**@nautobot-redis-headless:6379/0
- *** --- * --- .> concurrency: 5 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> celery exchange=celery(direct) key=celery
[tasks]
. nautobot.extras.datasources.git.pull_git_repository_and_refresh_data
. nautobot.extras.jobs.run_job
. nautobot.extras.jobs.scheduled_job_handler
. nautobot.extras.tasks.delete_custom_field_data
. nautobot.extras.tasks.process_webhook
. nautobot.extras.tasks.provision_field
. nautobot.extras.tasks.update_custom_field_choice_data
. nautobot.utilities.tasks.get_releases
[2022-02-01 12:05:35,156: INFO/MainProcess] Connected to redis://:**@nautobot-redis-headless:6379/0
[2022-02-01 12:05:35,176: INFO/MainProcess] mingle: searching for neighbors
[2022-02-01 12:05:36,216: INFO/MainProcess] mingle: all alone
[2022-02-01 12:05:36,270: INFO/MainProcess] celery@nautobot-celery-worker-d65589fbc-swk7t ready.
[2022-02-01 12:05:39,746: INFO/MainProcess] sync with celery@nautobot-celery-worker-d65589fbc-dpjk2
worker: Warm shutdown (MainProcess)
Nautobot worker output:
k logs nautobot-68bcb95f6b-5tdtw -f
Performing database migrations...
Operations to perform:
Apply all migrations: admin, auth, circuits, contenttypes, database, dcim, django_celery_beat, extras, ipam, sessions, social_django, taggit, tenancy, users, virtualization
Running migrations:
No migrations to apply.
Generating cable paths...
Found no missing circuit termination paths; skipping
Found no missing console port paths; skipping
Found no missing console server port paths; skipping
Found no missing interface paths; skipping
Found no missing power feed paths; skipping
Found no missing power outlet paths; skipping
Found no missing power port paths; skipping
Finished.
Collecting static files...
974 static files copied to '/opt/nautobot/static'.
Removing stale content types...
Removing expired sessions...
⏳ Running initial systems check...
System check identified some issues:
WARNINGS:
?: (security.W004) You have not set a value for the SECURE_HSTS_SECONDS setting. If your entire site is served only over SSL, you may want to consider setting a value and enabling HTTP Strict Transport Security. Be sure to read the documentation first; enabling HSTS carelessly can cause serious, irreversible problems.
?: (security.W008) Your SECURE_SSL_REDIRECT setting is not set to True. Unless your site should be available over both SSL and non-SSL connections, you may want to either set this setting True or configure a load balancer or reverse-proxy server to redirect all connections to HTTPS.
?: (security.W012) SESSION_COOKIE_SECURE is not set to True. Using a secure-only session cookie makes it more difficult for network traffic sniffers to hijack user sessions.
?: (security.W016) You have 'django.middleware.csrf.CsrfViewMiddleware' in your MIDDLEWARE, but you have not set CSRF_COOKIE_SECURE to True. Using a secure-only CSRF cookie makes it more difficult for network traffic sniffers to steal the CSRF token.
System check identified 4 issues (0 silenced).
kubectl describe pod output for a nautobot worked node:
k describe pod nautobot-68bcb95f6b-5tdtw
Name: nautobot-68bcb95f6b-5tdtw
Namespace: default
Priority: 0
Node: k3d-k3d-rancher-server-0/172.20.0.2
Start Time: Tue, 01 Feb 2022 12:58:56 +0100
Labels: app.kubernetes.io/component=nautobot
app.kubernetes.io/instance=nautobot
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=nautobot
helm.sh/chart=nautobot-1.2.0
pod-template-hash=68bcb95f6b
Annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
IP: 10.42.0.49
IPs:
IP: 10.42.0.49
Controlled By: ReplicaSet/nautobot-68bcb95f6b
Containers:
nautobot:
Container ID: containerd://e7d1478996a8d51fa23572314ff4d6b7b350f8c7a590cc5e48d4cdc9f62a2b20
Image: ghcr.io/nautobot/nautobot:1.2.4-py3.9
Image ID: ghcr.io/nautobot/nautobot@sha256:372e6e638c62ca388cfce42e96b8d9450c00210562c1ce2de04af49d4ed83536
Ports: 8443/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Tue, 01 Feb 2022 13:36:17 +0100
Finished: Tue, 01 Feb 2022 13:37:46 +0100
Ready: False
Restart Count: 13
Limits:
cpu: 600m
memory: 8704M
Requests:
cpu: 100m
memory: 1280M
Liveness: http-get http://:http/health/ delay=30s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:http/health/ delay=30s timeout=5s period=10s #success=1 #failure=3
Environment Variables from:
nautobot-env ConfigMap Optional: false
nautobot-env Secret Optional: false
Environment: <none>
Mounts:
/opt/nautobot/git from git-repos (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-295rr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
git-repos:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-295rr:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 39m default-scheduler Successfully assigned default/nautobot-68bcb95f6b-5tdtw to k3d-k3d-rancher-server-0
Normal Pulled 39m kubelet Successfully pulled image "ghcr.io/nautobot/nautobot:1.2.4-py3.9" in 518.130084ms
Normal Pulling 38m (x2 over 39m) kubelet Pulling image "ghcr.io/nautobot/nautobot:1.2.4-py3.9"
Normal Pulled 38m kubelet Successfully pulled image "ghcr.io/nautobot/nautobot:1.2.4-py3.9" in 534.001792ms
Normal Created 38m (x2 over 39m) kubelet Created container nautobot
Normal Started 38m (x2 over 39m) kubelet Started container nautobot
Normal Killing 37m (x2 over 38m) kubelet Container nautobot failed liveness probe, will be restarted
Warning Unhealthy 19m (x50 over 39m) kubelet Readiness probe failed: Get "http://10.42.0.49:8080/health/": dial tcp 10.42.0.49:8080: connect: connection refused
Warning Unhealthy 9m34s (x36 over 39m) kubelet Liveness probe failed: Get "http://10.42.0.49:8080/health/": dial tcp 10.42.0.49:8080: connect: connection refused
Warning BackOff 4m29s (x90 over 29m) kubelet Back-off restarting failed container
kubectl describe pod output for a celery worked node:
k describe pod nautobot-celery-worker-d65589fbc-dpjk2
Name: nautobot-celery-worker-d65589fbc-dpjk2
Namespace: default
Priority: 0
Node: k3d-k3d-rancher-server-0/172.20.0.2
Start Time: Tue, 01 Feb 2022 12:59:08 +0100
Labels: app.kubernetes.io/component=nautobot-celery-worker
app.kubernetes.io/instance=nautobot
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=nautobot
helm.sh/chart=nautobot-1.2.0
pod-template-hash=d65589fbc
Annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
IP: 10.42.0.55
IPs:
IP: 10.42.0.55
Controlled By: ReplicaSet/nautobot-celery-worker-d65589fbc
Containers:
nautobot-celery:
Container ID: containerd://1fc62cd4e4ca2daaed517bd19543983508acbd1631fc0c3e33bdda4d69d4e318
Image: ghcr.io/nautobot/nautobot:1.2.4-py3.9
Image ID: ghcr.io/nautobot/nautobot@sha256:372e6e638c62ca388cfce42e96b8d9450c00210562c1ce2de04af49d4ed83536
Port: <none>
Host Port: <none>
Command:
nautobot-server
celery
worker
--loglevel
$(NAUTOBOT_LOG_LEVEL)
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 01 Feb 2022 13:31:21 +0100
Finished: Tue, 01 Feb 2022 13:34:20 +0100
Ready: False
Restart Count: 9
Limits:
cpu: 3328m
memory: 6656M
Requests:
cpu: 400m
memory: 1G
Liveness: exec [bash -c nautobot-server celery inspect ping --destination celery@$HOSTNAME] delay=30s timeout=10s period=60s #success=1 #failure=3
Readiness: exec [bash -c nautobot-server celery inspect ping --destination celery@$HOSTNAME] delay=30s timeout=10s period=60s #success=1 #failure=3
Environment Variables from:
nautobot-env ConfigMap Optional: false
nautobot-env Secret Optional: false
Environment: <none>
Mounts:
/opt/nautobot/git from git-repos (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjq2w (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
git-repos:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-cjq2w:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 40m default-scheduler 0/1 nodes are available: 1 Insufficient memory.
Warning FailedScheduling 40m (x1 over 40m) default-scheduler 0/1 nodes are available: 1 Insufficient memory.
Normal Scheduled 40m default-scheduler Successfully assigned default/nautobot-celery-worker-d65589fbc-dpjk2 to k3d-k3d-rancher-server-0
Normal Pulled 40m kubelet Successfully pulled image "ghcr.io/nautobot/nautobot:1.2.4-py3.9" in 514.728042ms
Normal Pulled 37m kubelet Successfully pulled image "ghcr.io/nautobot/nautobot:1.2.4-py3.9" in 513.247459ms
Normal Killing 34m (x2 over 37m) kubelet Container nautobot-celery failed liveness probe, will be restarted
Normal Pulling 34m (x3 over 40m) kubelet Pulling image "ghcr.io/nautobot/nautobot:1.2.4-py3.9"
Normal Pulled 34m kubelet Successfully pulled image "ghcr.io/nautobot/nautobot:1.2.4-py3.9" in 514.411292ms
Normal Created 34m (x3 over 40m) kubelet Created container nautobot-celery
Normal Started 34m (x3 over 40m) kubelet Started container nautobot-celery
Warning Unhealthy 10m (x25 over 39m) kubelet Liveness probe failed:
Warning Unhealthy 5m7s (x30 over 39m) kubelet Readiness probe failed:
Warning BackOff 12s (x49 over 16m) kubelet Back-off restarting failed container
Can you post the logs from the crashed containers using kubectl logs --previous? From the describe output it looks like the containers crashed/errored out, didn't OOM. Are you using docker desktop? I'm not a k3s user so sorry I can't speak to knowledgeably about that, however, on MacOS you are likely running a VM to run the containers (whether you know it or not). In Docker Desktop the specs for the VM are exposed in the UI in Preferences/Resources/Advanced. It's possible these VM specs need to be increased. If you're not using docker desktop you'll have to figure out how to bump those up.
The logs posted have been collected with -f, the end of the snippets is the point where that command terminated and I assume the containers did so as well. Running with previous yields the same results as those posted above. I am using Docker desktop (currently 4.3.2). I have bumped the RAM for the Docker Desktop VM to 8GB prior to the collection of the above logs. The resource usage metrics in Rancher suggest this is plenty for both RAM and CPU cores.
I ran into the same issue. During my debugging I noticed it takes a while for uWSGI to actually serve HTTP, almost as long as initialDelaySeconds + periodSeconds * failureThreshold. Tweaking with the livelinessProbe made it work.
By the way: I'd suggest using startupProbe instead of using initialDelaySeconds, it's recommended these days.
Is this still relevant? Or I should proceed closing it?
Not relevant for me at this time, so I guess the issue can be closed. Does it make sense to take into account @jonkerj's suggestion?
Sure, I will create another one for that. I will close this one for now. Thanks a lot