piraeus-operator icon indicating copy to clipboard operation
piraeus-operator copied to clipboard

Back-off restarting failed container drbd-shutdown-guard (2.0.1)

Open vasyakrg opened this issue 2 years ago • 16 comments

after up ver to 2.0.1

Back-off restarting with:

2023/03/13 10:20:03 failed: failed to reload systemd
2023/03/13 10:20:52 Running drbd-shutdown-guard version v1.0.0
2023/03/13 10:20:52 Creating service directory '/run/drbd-shutdown-guard'
2023/03/13 10:20:52 Copying drbdsetup to service directory
2023/03/13 10:20:52 Copying drbd-shutdown-guard to service directory
2023/03/13 10:20:52 Optionally: relabel service directory for SELinux
2023/03/13 10:20:52 ignoring error when setting selinux label: exit status 127
2023/03/13 10:20:52 Creating systemd unit drbd-shutdown-guard.service in /run/systemd/system
2023/03/13 10:20:52 Reloading systemd
Error: failed to reload systemd
Usage:
  drbd-shutdown-guard install [flags]

Flags:
  -h, --help   help for install

2023/03/13 10:20:52 failed: failed to reload systemd

in LinstorSatellite

OS on host:

  • Release : Ubuntu 22.04.2 LTS
  • Kernel : Linux 5.15.0-67-generic x86_64

vasyakrg avatar Mar 13 '23 10:03 vasyakrg

I have this problem on Debian 11

andlf avatar Mar 13 '23 10:03 andlf

I can't reproduce this on a simple Ubuntu 22.04 cluster. I use containerd and kubeadm to create the cluster. Is there anything in any of the system logs? Perhaps app-armor interfering?

The step where it fails, the init container would execute systemctl daemon-reload. Perhaps there is some permission error from inside the container.

In any case, a workaround for now:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: no-drbd-shutdown-guard
spec:
  patches:
  - target:
      kind: Pod
      name: satellite
    patch: |
      apiVersion: v1
      kind: Pod
      metadata:
        name: satellite
      spec:
        initContainers:
        - name: drbd-shutdown-guard
          $patch: delete

WanzenBug avatar Mar 13 '23 11:03 WanzenBug

I can't reproduce this on a simple Ubuntu 22.04 cluster. I use containerd and kubeadm to create the cluster. Is there anything in any of the system logs? Perhaps app-armor interfering?

The step where it fails, the init container would execute systemctl daemon-reload. Perhaps there is some permission error from inside the container.

In any case, a workaround for now:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: no-drbd-shutdown-guard
spec:
  patches:
  - target:
      kind: Pod
      name: satellite
    patch: |
      apiVersion: v1
      kind: Pod
      metadata:
        name: satellite
      spec:
        initContainers:
        - name: drbd-shutdown-guard
          $patch: delete

yes, its work. cluster up and ready. i am used rke2 to create k8s cluster

vasyakrg avatar Mar 13 '23 12:03 vasyakrg

logs clear. only run and stop for its container

Mar 13 12:56:57 rke2-node1 systemd[1]: cri-containerd-81ef8139d96334565e7ad0c6a7255f767b6f76438dcb4a42c966d66cb1e886e7.scope: Deactivated successfully.
Mar 13 12:56:57 rke2-node1 systemd[1]: run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-81ef8139d96334565e7ad0c6a7255f767b6f76438dcb4a42c966d66cb1e886e7-rootfs.mount: Deactivated successfully.
Mar 13 12:57:39 rke2-node1 systemd[1]: Started libcontainer container c09dc34921d87ce927da8fd87d55f424ef778425f6008f3da9a133040b2f20a4.
Mar 13 12:57:39 rke2-node1 systemd[1]: cri-containerd-c09dc34921d87ce927da8fd87d55f424ef778425f6008f3da9a133040b2f20a4.scope: Deactivated successfully.
Mar 13 12:57:40 rke2-node1 systemd[1]: run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-c09dc34921d87ce927da8fd87d55f424ef778425f6008f3da9a133040b2f20a4-rootfs.mount: Deactivated successfully.

vasyakrg avatar Mar 13 '23 12:03 vasyakrg

Any special security configuration? I just ran the rke2 setup with default settings and it seemed to start fine :/

WanzenBug avatar Mar 13 '23 13:03 WanzenBug

nope. default install from hands.

vasyakrg avatar Mar 13 '23 13:03 vasyakrg

Hi all,

how can v2.0.1 be used on systems without Systemd like Talos? Isn't it a bad idea at all to add an external OS dependency?

Nosmoht avatar Mar 13 '23 13:03 Nosmoht

See the above config "patch". You may also need to remove the host-mounts.

Isn't it a bad idea at all to add an external OS dependency?

The depenceny was deemed worth it, as most users will have systemd installed, and the shutdown guard solves an issue many users will run into once they shut down their resources without evicting all pods beforehand.

WanzenBug avatar Mar 13 '23 13:03 WanzenBug

and how can I run the operator and create a cluster on a system with docker.io?

if to up the k8s-cluster via rke (first cli version from Rancher). there all cluster components are also in containers.

I write the mount in the configuration of k8s, the linstor cluster goes up and even creates disks, but they are in RO-access

kubelet:
    extra_binds:
      - "/usr/lib/modules:/usr/lib/modules"
      - "/var/lib/piraeus-datastore:/var/lib/piraeus-datastore"

vasyakrg avatar Mar 13 '23 13:03 vasyakrg

What exactly is RO? The volumes created by using a piraeus storage class? This seems to be a separate issue. Please create a new issue for it.

WanzenBug avatar Mar 13 '23 15:03 WanzenBug

I got the same problem Exactly the same error in the drbd-shutdown-guard log K8s version 1.28.5 Сilium cni I try with crio and containerd

Log operators pod 2024-02-16T09:54:27Z ERROR Reconciler error {"controller": "linstorcluster", "controllerGroup": "piraeus.io", "controllerKind": "LinstorCluster", "LinstorCluster": {"name":"linstorcluster"}, "namespace": "", "name": "linstorcluster", "reconcileID": "dfe748f6-b19d-4e25-945f-a69660e3753f", "error": "context deadline exceeded"}

2024-02-17 20 21 19 2024-02-17 20 21 08

VadimkP avatar Feb 17 '24 17:02 VadimkP

What host OS are you using?

WanzenBug avatar Feb 19 '24 06:02 WanzenBug

What host OS are you using?

Ubuntu 20.04

VadimkP avatar Feb 19 '24 06:02 VadimkP

:thinking: Also using RKE?

We can probably make shutdown-guard ignore these kinds of errors, but I want to make sure it is active in as many cases as possible, as it is a very useful feature...

WanzenBug avatar Feb 19 '24 06:02 WanzenBug

no RKE Simple k8s cluster consisting of three nodes for internal tests cn and workers combine

VadimkP avatar Feb 19 '24 07:02 VadimkP

Hello, exactly the same situation on Debian 12 + K8n + cri-o Client Version: v1.29.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.2 Fresh install on clean machine

workaround from comment https://github.com/piraeusdatastore/piraeus-operator/issues/426#issuecomment-1465934727 fixed it.

danoh avatar Mar 06 '24 09:03 danoh