error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input"
While debugging some other issues, I found that our Bottlerocket nodes are spamming their journald logs like this:
Sep 12 16:46:44 ip-100-64-189-233.us-west-2.compute.internal kubelet[1883]: E0912 16:46:44.845625 1883 plugins.go:752] "Error dynamically probing plugins" err="error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input"
Sep 12 16:46:44 ip-100-64-189-233.us-west-2.compute.internal audit[7291]: AVC avc: denied { execute } for pid=7291 comm="kubelet" name="uds" dev="nvme1n1p1" ino=2934 scontext=system_u:system_r:system_t:s0 tcontext=system_u:object_r:local_t:s0 tclass=file permissive=0
Sep 12 16:46:44 ip-100-64-189-233.us-west-2.compute.internal kubelet[1883]: E0912 16:46:44.845738 1883 driver-call.go:262] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input
Sep 12 16:46:44 ip-100-64-189-233.us-west-2.compute.internal kubelet[1883]: W0912 16:46:44.845746 1883 driver-call.go:149] FlexVolume: driver call failed: executable: /var/lib/kubelet/plugins/volume/exec/nodeagent~uds/uds, args: [init], error: fork/exec /var/lib/kubelet/plugins/volume/exec/nodeagent~uds/uds: permission denied, output: ""
We are using the Tigera Operator to install Calico on our nodes, and most of the features seem to work just fine. I don't know much about the UDS system, but I did find that some work was previously done (https://github.com/bottlerocket-os/bottlerocket/pull/1417) to help support this.
I have jumped into the host itself and found that the uds binary is indeed installed into that location, and it technically works:
/.bottlerocket/rootfs
[root@admin]# cd var/lib/kubelet/plugins/
ebs.csi.aws.com/ efs.csi.aws.com/ volume/
[root@admin]# cd var/lib/kubelet/plugins/volume/exec/nodeagent~uds/
[root@admin]# ./uds
Usage:
flexvoldrv [command]
Available Commands:
completion Generate the autocompletion script for the specified shell
help Help about any command
init Flex volume init command.
mount Flex volume mount command.
unmount Flex volume unmount command.
version Print version
Flags:
-h, --help help for flexvoldrv
Use "flexvoldrv [command] --help" for more information about a command.
[root@admin]#
The thing that seems suspicious to me is the AVC avc error, but I am a little out of my depth on that one. Could there be some security setting on the Bottlerocket AMI preventing this process from being started?
Image I'm using:
Bottlerocket 1.9.1 for EKS 1.23 Calico 3.23.3 TigerAoperator: 1.27.12
What I expected to happen:
No errors? :)
What actually happened:
Errors. :)
(In fairness, FlexVolume is deprecated ... and I can turn it off... just pointing this out though).
Thanks for bringing this up!
The thing that seems suspicious to me is the AVC avc error, but I am a little out of my depth on that one. Could there be some security setting on the Bottlerocket AMI preventing this process from being started?
Bottlerocket has SElinux set to enforcing mode. The AVC message indicates that SElinux has denied an action. In this case, it seems like SElinux denied kubelet from exec-ing the uds binary. We'll take a closer look at this!
Hi, same issue in my EKS with BottleRocket. It also caused a lot of log ingest in cloudwatch because of kubelet logs.
EKS: 1.23.13 Bottlerocket: 1.11.0 containerd://1.6.8+bottlerocket
Thank you!
Hi, Bottlerocket introduced this change back in June 2022 that disallowed container runtime processes from being able to execute host binaries. This was done to better improve our security posture after some learnings from log4j. Flexvolume plugins are an unfortunate causality of the change.
As mentioned in the issue. Flexvolume is deprecated and can be turned off. Can you try switching that off to see if it helps clearing up the logs? I'm gonna go ahead and close this issue since we don't plan on reverting the selinux changes to enable this. Please create a new issue if you need a workaround for needing to actually use flexvolume plugins.
Hi, how can I turned off?
@guillermobandres you can turn it off by setting the flexVolumePath parameter to None in the installations.operator.tigera.io CRD. I would also suggest doing the same for the kubeletVolumePluginPath parameter.
Thank you @stevehipwell it is what I did, setting in calico operator config flexVolumePath to None. I tried to enable again fluent-bit to send log for kubelet to cloudwatch but I still having a lot of message with the same error. I tried to set kubeletVolumePluginPath but calico operator didn't change anything, and pod weren't restarted.
Thank you!
@guillermobandres which Tigera Operator version are you on? From memory when I did this I think I had to replace the nodes.
Hi @stevehipwell I'm using v1.20.1. I verified that is the same version that is indicated on AWS Docs
Thank you
@guillermobandres do you mean v3.20.1 which would be the Calico version? I'm not sure how up to date the AWS docs for Calico are, or even if they're maintained but you'd be strongly advised to at least take the latest patch version.
@stevehipwell The tigera-operator is running in version 1.20.1 but it deploys calico version 3.20.0.
Thank you
@guillermobandres that is quite an old version and I'm not sure if the CRD fields above are supported.
@stevehipwell the first parameter flexVolumePath is supported and it redeployed calico pods without an init-container which config somethig about flex-volume.
The second kubeletVolumePluginPath, seems to be not supported
This is the official aws templates for calico opertator installation
https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/master/calico-operator.yaml
Thank you