k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

[gfd] Add option to disable automatic cleanup features file on gpu-feature-discovery exit

Open belo4ya opened this issue 1 year ago • 2 comments

Issue description

We use the node-feature-discovery and gpu-feature-discovery features to monitor GPU issues, including cases when the number of available GPUs on a node unexpectedly decreases: Target Number == nvidia.com/gpu.count == Node Allocatable.

We have noticed that sometimes after restarting gpu-feature-discovery, all the features (labels nvidia.com/*) exported by gpu-feature-discovery disappear from the node for a period roughly equal to the nfd-worker sleepInterval (in our case, 1 minute). This causes false positives in our monitoring system.

We found that this occurs because gpu-feature-discovery deletes the features.d/gfd file before terminating if it is not running in one-shot mode (done using the removeOutputFile function).

This behavior is very inconvenient (and undesirable) for us, especially when updating the gpu-feature-discovery version in the cluster.

Feature request

I found that this behavior was added with this commit - https://github.com/NVIDIA/gpu-feature-discovery/commit/bc91c4aec84c2bc3e6da47789d6d0a0326330455. However, I did not find an associated Issue justifying the need for this behavior.

Could you please consider:

  1. Adding an option to disable automatic cleanup before gpu-feature-discovery terminates using a flag (and/or environment variable) (e.g., --no-cleanup-on-exit).
  2. Or the refusal to automatically clean up before shutting down the gpu-feature-discovery.

An argument for 2. could be that node-feature-discovery does not do this. Instead, it uses a prune-job.

belo4ya avatar Jul 01 '24 00:07 belo4ya

@elezar, @klueska, @ArangoGutierrez, please take a look at this

belo4ya avatar Jul 10 '24 06:07 belo4ya

Thanks @belo4ya, I have created #899 to add this option and we can continue this discussion there.

@ArangoGutierrez one thing that I noted is that we don't do any cleanup when the NodeFeatureAPI is used. How are labels removed in this case?

elezar avatar Aug 12 '24 12:08 elezar

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Nov 11 '24 04:11 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Dec 11 '24 04:12 github-actions[bot]