bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input"

Open diranged opened this issue 3 years ago • 2 comments

While debugging some other issues, I found that our Bottlerocket nodes are spamming their journald logs like this:

Sep 12 16:46:44 ip-100-64-189-233.us-west-2.compute.internal kubelet[1883]: E0912 16:46:44.845625    1883 plugins.go:752] "Error dynamically probing plugins" err="error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input"
Sep 12 16:46:44 ip-100-64-189-233.us-west-2.compute.internal audit[7291]: AVC avc:  denied  { execute } for  pid=7291 comm="kubelet" name="uds" dev="nvme1n1p1" ino=2934 scontext=system_u:system_r:system_t:s0 tcontext=system_u:object_r:local_t:s0 tclass=file permissive=0
Sep 12 16:46:44 ip-100-64-189-233.us-west-2.compute.internal kubelet[1883]: E0912 16:46:44.845738    1883 driver-call.go:262] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input
Sep 12 16:46:44 ip-100-64-189-233.us-west-2.compute.internal kubelet[1883]: W0912 16:46:44.845746    1883 driver-call.go:149] FlexVolume: driver call failed: executable: /var/lib/kubelet/plugins/volume/exec/nodeagent~uds/uds, args: [init], error: fork/exec /var/lib/kubelet/plugins/volume/exec/nodeagent~uds/uds: permission denied, output: ""

We are using the Tigera Operator to install Calico on our nodes, and most of the features seem to work just fine. I don't know much about the UDS system, but I did find that some work was previously done (https://github.com/bottlerocket-os/bottlerocket/pull/1417) to help support this.

I have jumped into the host itself and found that the uds binary is indeed installed into that location, and it technically works:


/.bottlerocket/rootfs
[root@admin]# cd var/lib/kubelet/plugins/   
ebs.csi.aws.com/ efs.csi.aws.com/ volume/          
[root@admin]# cd var/lib/kubelet/plugins/volume/exec/nodeagent~uds/
[root@admin]# ./uds 
Usage:
  flexvoldrv [command]

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command
  init        Flex volume init command.
  mount       Flex volume mount command.
  unmount     Flex volume unmount command.
  version     Print version

Flags:
  -h, --help   help for flexvoldrv

Use "flexvoldrv [command] --help" for more information about a command.
[root@admin]# 

The thing that seems suspicious to me is the AVC avc error, but I am a little out of my depth on that one. Could there be some security setting on the Bottlerocket AMI preventing this process from being started?

Image I'm using:

Bottlerocket 1.9.1 for EKS 1.23 Calico 3.23.3 TigerAoperator: 1.27.12

What I expected to happen:

No errors? :)

What actually happened:

Errors. :)

diranged avatar Sep 12 '22 17:09 diranged

(In fairness, FlexVolume is deprecated ... and I can turn it off... just pointing this out though).

diranged avatar Sep 12 '22 17:09 diranged

Thanks for bringing this up!

The thing that seems suspicious to me is the AVC avc error, but I am a little out of my depth on that one. Could there be some security setting on the Bottlerocket AMI preventing this process from being started?

Bottlerocket has SElinux set to enforcing mode. The AVC message indicates that SElinux has denied an action. In this case, it seems like SElinux denied kubelet from exec-ing the uds binary. We'll take a closer look at this!

etungsten avatar Sep 13 '22 16:09 etungsten

Hi, same issue in my EKS with BottleRocket. It also caused a lot of log ingest in cloudwatch because of kubelet logs.

EKS: 1.23.13 Bottlerocket: 1.11.0 containerd://1.6.8+bottlerocket

Thank you!

guillermobandres avatar Jan 18 '23 14:01 guillermobandres

Hi, Bottlerocket introduced this change back in June 2022 that disallowed container runtime processes from being able to execute host binaries. This was done to better improve our security posture after some learnings from log4j. Flexvolume plugins are an unfortunate causality of the change.

As mentioned in the issue. Flexvolume is deprecated and can be turned off. Can you try switching that off to see if it helps clearing up the logs? I'm gonna go ahead and close this issue since we don't plan on reverting the selinux changes to enable this. Please create a new issue if you need a workaround for needing to actually use flexvolume plugins.

etungsten avatar Jan 18 '23 18:01 etungsten

Hi, how can I turned off?

guillermobandres avatar Jan 19 '23 07:01 guillermobandres

@guillermobandres you can turn it off by setting the flexVolumePath parameter to None in the installations.operator.tigera.io CRD. I would also suggest doing the same for the kubeletVolumePluginPath parameter.

stevehipwell avatar Jan 19 '23 11:01 stevehipwell

Thank you @stevehipwell it is what I did, setting in calico operator config flexVolumePath to None. I tried to enable again fluent-bit to send log for kubelet to cloudwatch but I still having a lot of message with the same error. I tried to set kubeletVolumePluginPath but calico operator didn't change anything, and pod weren't restarted.

Thank you!

guillermobandres avatar Jan 19 '23 11:01 guillermobandres

@guillermobandres which Tigera Operator version are you on? From memory when I did this I think I had to replace the nodes.

stevehipwell avatar Jan 19 '23 11:01 stevehipwell

Hi @stevehipwell I'm using v1.20.1. I verified that is the same version that is indicated on AWS Docs

Thank you

guillermobandres avatar Jan 19 '23 11:01 guillermobandres

@guillermobandres do you mean v3.20.1 which would be the Calico version? I'm not sure how up to date the AWS docs for Calico are, or even if they're maintained but you'd be strongly advised to at least take the latest patch version.

stevehipwell avatar Jan 19 '23 12:01 stevehipwell

@stevehipwell The tigera-operator is running in version 1.20.1 but it deploys calico version 3.20.0.

Thank you

guillermobandres avatar Jan 19 '23 12:01 guillermobandres

@guillermobandres that is quite an old version and I'm not sure if the CRD fields above are supported.

stevehipwell avatar Jan 19 '23 13:01 stevehipwell

@stevehipwell the first parameter flexVolumePath is supported and it redeployed calico pods without an init-container which config somethig about flex-volume.

The second kubeletVolumePluginPath, seems to be not supported

This is the official aws templates for calico opertator installation

https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/master/calico-operator.yaml

Thank you

guillermobandres avatar Jan 19 '23 13:01 guillermobandres