gitops-engine 'Apply/ReplaceResource' in resource_ops.go may leak files to '/dev/shm' since the kubectl 'apply/replace' commands never time out

'Apply/ReplaceResource' in resource_ops.go may leak files to '/dev/shm' since the kubectl 'apply/replace' commands never time out

Open jgwest opened this issue 9 months ago • 0 comments

gitops-engine directly calls kubectl command code to create/apply/replace/delete K8s resources on the cluster. This ensures that the logic used by gitops-engine consumers (such as Argo CD) interacts with those K8s resources in a way that is compatible to kubectl.

However, at present, gitops-engine does not specify a timeout value for 'kubectl create/apply/replace' commands.

This means that in rare cases (such as cluster/network issues), the kubectl operation will remaining running forever, waiting for an I/O operation that may never complete.

Normally this would just be a small memory leak (i.e. not necessarily the end of the world), however, in order to call the kubectl command code, gitops-engine writes manifest files to '/dev/shm', which are then passed via the '-f' file option to kubectl.

This means that those long-running I/O operations are also leaking K8s manifest files to /dev/shm: the K8s manifest files must remain in '/dev/shm' while the I/O operation is in progress. '/dev/shm' appears limited to 64MB, which can fill quickly.

When examining the contents of /dev/shm from users that have reported this issue, we see a large number of miscellanous manifests that are hours or days old (dating back to the lasted Pod restart).

The proposed solution (PR attached) is to add a long default timeout to calls to kubectl's apply command.

Related: https://github.com/argoproj/gitops-engine/issues/568

May 04 '24 10:05 jgwest

gitops-engine gitops-engine copied to clipboard

'Apply/ReplaceResource' in resource_ops.go may leak files to '/dev/shm' since the kubectl 'apply/replace' commands never time out

gitops-engine
gitops-engine copied to clipboard