piraeus-operator icon indicating copy to clipboard operation
piraeus-operator copied to clipboard

drbd-reactor crashing

Open nashant opened this issue 2 years ago • 11 comments

One of my satellite pods is crashlooping. It's because of the drbd-reactor pod which is giving only the following logs:

$ k logs -n piraeus-datastore server -c drbd-reactor
Error: main: core did not exit successfully

Caused by:
    sending on a disconnected channel

Any idea?

nashant avatar Feb 26 '23 08:02 nashant

Which version of the image is running? If your are not already using the v1.0.0 image, please try to upgrade to that, as that has generally better error reporting.

WanzenBug avatar Feb 27 '23 08:02 WanzenBug

Yup, already using v1.0.0

nashant avatar Feb 28 '23 08:02 nashant

Any thoughts? Can I increase logging somehow?

nashant avatar Mar 02 '23 12:03 nashant

You might be able to add a second entry to the piraeus-op-node-monitoring configmap:

data:
  log.toml: |
    [[log]]
    level = "debug"

WanzenBug avatar Mar 02 '23 13:03 WanzenBug

Experiencing the same and even with the trace level, there is no more info: image

RichardSufliarsky avatar Apr 17 '23 09:04 RichardSufliarsky

We noticed that sometimes reactor still discards some log messages, especially the log message when creating the Prometheus socket. I assume in both cases this is related to reactor for some reason not being able to bind to [::]:9942. As for why, I cannot tell. Perhaps some strange network configuration with disabling IPv6 on the kernel level?

WanzenBug avatar Apr 17 '23 09:04 WanzenBug

Correct, we have IPv6 disabled in the kernel: GRUB_CMDLINE_LINUX="rd.lvm.lv=rhel/root rhgb quiet ipv6.disable=1" I am also using ipFamilyPolicy: SingleStack when creating LinstorCluster:

    - target:
        kind: Service
        name: linstor-controller
      patch: |-
        apiVersion: v1
        kind: service
        metadata:
          name: linstor-controller
        spec:
          ipFamilyPolicy: SingleStack

RichardSufliarsky avatar Apr 17 '23 09:04 RichardSufliarsky

Then you probably need to patch the reactor config to use 0.0.0.0:9942 instead of the anylocal [::] address. This normally works fine even on IPv4 only systems, but directly disabling the IPv6 subsystem tends to break those.

WanzenBug avatar Apr 17 '23 10:04 WanzenBug

Please, can you point me where can I change this globally via CRD? When I edit nodename-reactor-config config map directly and delete the pod for that nodename to restart drbd-reactor, then the address in config map gets replaced back to [::], though log.tml part with trace stays there untouched (I added that also manually).

RichardSufliarsky avatar Apr 17 '23 10:04 RichardSufliarsky

Sorry, found it: https://github.com/piraeusdatastore/piraeus-operator/issues/441#issuecomment-1484970615

RichardSufliarsky avatar Apr 17 '23 10:04 RichardSufliarsky

No drbd-reactor container crash since I have set Prometheus address to 0.0.0.0:9942:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: drbd-reactor-trace
spec:
  patches:
    - target:
        kind: ConfigMap
        name: reactor-config
      patch: |
        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: reactor-config
          labels:
            app.kubernetes.io/component: linstor-satellite
        data:
          prometheus.toml: |
            [[prometheus]]
            enums = true
            address = "0.0.0.0:9942"
            
            [[log]]
            level = "trace"
          log.toml: |
            [[log]]
            level = "trace"

RichardSufliarsky avatar Apr 18 '23 20:04 RichardSufliarsky