allow Telegraf to start even if nvidia-smi or rocm-smi inputs are not available on a cluster
Use Case
We have a compute cluster. Some nodes have GPUs, some don't. Some are NVIDIA, one is AMD. Essentially similar to #3723 and related, but instead of kafka related to GPUs.
We share telgraf.conf via NFS on all the nodes. We had to make 2 copies (and now will be a 3rd) of it, with the only difference between them is comment out [[inputs.nvidia_smi]] on CPU-only nodes. And our service start script has to determine it is running on a GPU node or not and start the Telefraf container either with nvidia GPU config or not. There is of config is identical, such as CPU metrics, memory. etc, etc.
We would like it so that only one conf file is maintained, and if nvidia-smi or rocm-smi don't exist, ignore them and don't fail the telegraf service.
Expected behavior
Telegraf service continues to run for other metrics as per normal but just logs that nvidia-smi or rocm-smi not found.
Actual behavior
Telegraf service fails to start.
Additional info
No response
Similar to other start up on error issues, we would be happy to see a PR that follows the current pattern of other plugins.
Similar to other start up on error issues, we would be happy to see a PR that follows the current pattern of other plugins.
@powersj -- do you have pointers where to look to get started on something like that? Not a Go expert, but may be we can concoct something... Thank you.
@smokhov you could also create a superset config and use --input-filter or simply strip the custom parts out to an own config an pass that in with --config for the nodes that do have the devices...
Some prior art:
- https://github.com/influxdata/telegraf/pull/12828
- https://github.com/influxdata/telegraf/pull/14534
Essentially adding a config option that is similar to:
## Behavior when we fail to connect to the endpoint on initialization. Valid options are:
## "error": throw an error and exits Telegraf
## "ignore": ignore this plugin if errors are encountered
# connect_fail_behavior = "error"