helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[stable/node-problem-detector] "permission denied" when using script in custom_monitor_definitions

Open yogeek opened this issue 2 years ago • 1 comments

With this values.yaml to add a custom plugin :

values.yaml
settings:
  prometheus_address: 0.0.0.0
  prometheus_port: 20257

  custom_monitor_definitions:
    drainme.sh: |
      #!/bin/bash
      set -euo pipefail

      echo "Checking commands..."
      for cmd in curl jq
      do
        if ! command -v $cmd &> /dev/null
        then
          echo "installing $cmd..."
          apt update -qq >/dev/null 2>&1 && apt install -y $cmd -qq >/dev/null 2>&1
        fi
      done

      # Point to the internal API server hostname
      APISERVER=https://kubernetes.default.svc

      # Path to ServiceAccount token
      SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount

      # Read this Pod's namespace
      NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)

      # Read the ServiceAccount bearer token
      TOKEN=$(cat ${SERVICEACCOUNT}/token)

      # Reference the internal certificate authority (CA)
      CACERT=${SERVICEACCOUNT}/ca.crt

      # Call node API with NODE_NAME
      echo "Checking current node = $NODE_NAME..."
      drainme=$(curl -s --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api/v1/nodes/${NODE_NAME} | jq -r '.metadata.labels.drainme')

      if [[ "$drainme" == "true" ]]
      then
        echo "Drain requested"
        exit 1
      fi
      echo "No drain needed"
      exit 0
    drainme.json: |
      {
        "plugin": "custom",
        "pluginConfig": {
          "invoke_interval": "10s",
          "timeout": "3m",
          "max_output_length": 80,
          "concurrency": 1
        },
        "source": "drainme-custom-plugin-monitor",
        "conditions": [
          {
            "type": "DrainRequest",
            "reason": "NoDrain",
            "message": "No drain"
          }
        ],
        "rules": [
          {
            "type": "permanent",
            "condition": "DrainRequest",
            "reason": "DrainMe",
            "path": "/custom-config/drainme.sh"
          }
        ]
      }
  custom_plugin_monitors:
  - /custom-config/drainme.json

and this installation process :

helm template npd deliveryhero/node-problem-detector \
      --version  \
      --namespace "2.2.1" \
      --set image.tag="v0.8.10" \
      --values values.yaml \
| kubectl apply -f -

the resulting NPD pods logs contain error message because the custom plugin script has no execution permission : Error in starting plugin "/custom-config/drainme.sh": error - fork/exec /custom-config/drainme.sh: permission denied

Full pod log
I0505 18:39:35.316142       1 log_monitor.go:79] Finish parsing log monitor config file /config/kernel-monitor.json: {WatcherConfig:{Plugin:kmsg PluginConfig:map[] LogPath:/dev/kmsg Lookback:5m Delay:} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock} {Type:ReadonlyFilesystem Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.*} {Type:temporary Condition: Reason:TaskHung Pattern:task [\S ]+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:temporary Condition: Reason:Ext4Error Pattern:EXT4-fs error .*} {Type:temporary Condition: Reason:Ext4Warning Pattern:EXT4-fs warning .*} {Type:temporary Condition: Reason:IOError Pattern:Buffer I/O error .*} {Type:temporary Condition: Reason:MemoryReadError Pattern:CE memory read error .*} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:ReadonlyFilesystem Reason:FilesystemIsReadOnly Pattern:Remounting filesystem read-only}] EnableMetricsReporting:0xc00046071e}
I0505 18:39:35.316349       1 log_watchers.go:40] Use log watcher of plugin "kmsg"
I0505 18:39:35.316630       1 log_monitor.go:79] Finish parsing log monitor config file /config/docker-monitor.json: {WatcherConfig:{Plugin:journald PluginConfig:map[source:dockerd] LogPath:/var/log/journal Lookback:5m Delay:} BufferSize:10 Source:docker-monitor DefaultConditions:[{Type:CorruptDockerOverlay2 Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoCorruptDockerOverlay2 Message:docker overlay2 is functioning properly}] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*} {Type:permanent Condition:CorruptDockerOverlay2 Reason:CorruptDockerOverlay2 Pattern:returned error: readlink /var/lib/docker/overlay2.*: invalid argument.*} {Type:temporary Condition: Reason:DockerContainerStartupFailure Pattern:OCI runtime start failed: container process is already dead: unknown}] EnableMetricsReporting:0xc00046109a}
I0505 18:39:35.316664       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0505 18:39:35.316943       1 custom_plugin_monitor.go:81] Finish parsing custom plugin monitor config file /custom-config/drainme.json: {Plugin:custom PluginGlobalConfig:{InvokeIntervalString:0xc0003ecda0 TimeoutString:0xc0003ecdb0 InvokeInterval:10s Timeout:3m0s MaxOutputLength:0xc000461740 Concurrency:0xc000461750 EnableMessageChangeBasedConditionUpdate:0x223ebfd} Source:drainme-custom-plugin-monitor DefaultConditions:[{Type:DrainRequest Status: Transition:0001-01-01 00:00:00 +0000 UTC Reason:NoDrain Message:No drain}] Rules:[0xc000673490] EnableMetricsReporting:0x219a95b}
I0505 18:39:35.318331       1 k8s_exporter.go:54] Waiting for kube-apiserver to be ready (timeout 5m0s)...
I0505 18:39:35.411141       1 node_problem_detector.go:63] K8s exporter started.
I0505 18:39:35.411244       1 node_problem_detector.go:67] Prometheus exporter started.
I0505 18:39:35.411257       1 log_monitor.go:111] Start log monitor /config/kernel-monitor.json
I0505 18:39:35.411306       1 log_monitor.go:111] Start log monitor /config/docker-monitor.json
I0505 18:39:35.414268       1 log_watcher.go:80] Start watching journald
I0505 18:39:35.414296       1 custom_plugin_monitor.go:112] Start custom plugin monitor /custom-config/drainme.json
I0505 18:39:35.414311       1 problem_detector.go:76] Problem detector started
I0505 18:39:35.411712       1 log_monitor.go:235] Initialize condition generated: [{Type:KernelDeadlock Status:False Transition:2022-05-05 18:39:35.411686514 +0000 UTC m=+0.171292006 Reason:KernelHasNoDeadlock Message:kernel has no deadlock} {Type:ReadonlyFilesystem Status:False Transition:2022-05-05 18:39:35.411686668 +0000 UTC m=+0.171292135 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only}]
I0505 18:39:35.414754       1 custom_plugin_monitor.go:296] Initialize condition generated: [{Type:DrainRequest Status:False Transition:2022-05-05 18:39:35.414740692 +0000 UTC m=+0.174346186 Reason:NoDrain Message:No drain}]
I0505 18:39:35.414817       1 log_monitor.go:235] Initialize condition generated: [{Type:CorruptDockerOverlay2 Status:False Transition:2022-05-05 18:39:35.414806985 +0000 UTC m=+0.174412467 Reason:NoCorruptDockerOverlay2 Message:docker overlay2 is functioning properly}]
E0505 18:39:35.501582       1 plugin.go:164] Error in starting plugin "/custom-config/drainme.sh": error - fork/exec /custom-config/drainme.sh: permission denied
I0505 18:39:35.501697       1 custom_plugin_monitor.go:276] New status generated: &{Source:drainme-custom-plugin-monitor Events:[{Severity:info Timestamp:2022-05-05 18:39:35.501637077 +0000 UTC m=+0.261242614 Reason:NoDrain Message:Node condition DrainRequest is now: Unknown, reason: NoDrain}] Conditions:[{Type:DrainRequest Status:Unknown Transition:2022-05-05 18:39:35.501637077 +0000 UTC m=+0.261242614 Reason:NoDrain Message:Error in starting plugin. Please check the error log}]}
E0505 18:39:45.418820       1 plugin.go:164] Error in starting plugin "/custom-config/drainme.sh": error - fork/exec /custom-config/drainme.sh: permission denied
E0505 18:39:55.418852       1 plugin.go:164] Error in starting plugin "/custom-config/drainme.sh": error - fork/exec /custom-config/drainme.sh: permission denied
[...]

If I edit the daemonset manifest to add defaultMode: 0755, it is working well.

- name: custom-config
     configMap:
       name: npd-node-problem-detector-custom-config
       defaultMode: 0755

How the chart is supposed to handle this permission without this defaultMode please ? Is is possible to add this execution permission on the plugin script with the chart custom plugin configuration ?

yogeek avatar May 05 '22 18:05 yogeek

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 13 '22 03:08 stale[bot]

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Aug 28 '22 02:08 github-actions[bot]

/reopen please

yogeek avatar Sep 02 '22 11:09 yogeek

@yogeek I think you can just click Reopen button 🙂

max-rocket-internet avatar Sep 02 '22 12:09 max-rocket-internet

@max-rocket-internet do not see such button on my side sorry

yogeek avatar Sep 02 '22 19:09 yogeek

Oh I see, sorry!

max-rocket-internet avatar Sep 05 '22 07:09 max-rocket-internet

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Sep 21 '22 02:09 github-actions[bot]