aws-fsx-csi-driver icon indicating copy to clipboard operation
aws-fsx-csi-driver copied to clipboard

Add initContainers to the helm chart for the node DaemonSet

Open jon-rei opened this issue 6 months ago • 1 comments

Is your feature request related to a problem? Please describe.

We sometimes get no space left on device errors when running high throughput jobs on our Lustre FSx filesystem. This happens most often when the filesystem is also low on space. There is an AWS documentation page about this error, see here. It suggests a fix by setting this on the host: sudo lctl set_param osc.*.max_dirty_mb=64.

Describe the solution you'd like in detail

Our idea was to fix this by running an init container similar to this example on the node DaemonSet. This way we are 100% sure that this setting is set before our actual workload starts.

Describe alternatives you've considered

Run an initContainer on our workload pods. But since we are running hundreds of pods, some of which are running on the same node, this is not practical.

Would this be a reasonable approach to fix this problem? If so, I would raise a PR to be able to add an initContainer to the node DaemonSet.

jon-rei avatar Aug 23 '24 06:08 jon-rei