kepler chore: gpu cleanup

[x] remove init() functions in gpu devices - change to dynamic device registration.
[x] remove custom gpu tags from the Makefile with the exception of habana (libhlml.go cgo implementation in vendor directory is problematic) so that all device implementations are built in.
[x] add missing copyright tags
[x] cleanup dcgm, nvml, and habana globals

Aug 19 '24 12:08 maryamtahhan

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: The "chore: gpu cleanup" pull request refactors and cleans up GPU-related code, centralizing GPU configuration and updating accelerator interfaces. Key changes include:

Modified init() functions in GPU devices
Removed custom GPU flags
Updated GetActiveAcceleratorByType function calls
Centralized GPU configuration in main() function
Refactored AcceleratorType constant and related functions
Changed Registry type to use string keys instead of AcceleratorType
Updated Device() method in Accelerator interface
Changed New() function to accept a string type for the accelerator
Added habanaCheck function for Habana GPU initialization error checking

Impact: These internal cleanups do not affect the external interface or behavior of the code. However, developers should be aware of the updated import path and altered function signatures, which may impact dependent code.

Observations/Suggestions:

The changes seem to improve code organization and consistency, making it easier to maintain and extend GPU-related functionality.
It's essential to thoroughly test the updated code to ensure it doesn't introduce any regressions or compatibility issues.
Consider adding more detailed comments or documentation to explain the reasoning behind these changes and how they affect the codebase.
If there are any dependent projects or code that rely on the old GPU flags or function signatures, ensure they are updated accordingly to avoid compatibility issues.

Aug 19 '24 12:08 github-actions[bot]

will rebase once https://github.com/sustainable-computing-io/kepler/pull/1788 is merged

Sep 27 '24 15:09 maryamtahhan

Tested with Kubeadm :

 kubectl logs -n kepler kepler-exporter-wm9sd
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I0930 13:33:58.022593  779728 exporter.go:103] Kepler running on version: 0b12e73b-dirty
I0930 13:33:58.022740  779728 config.go:293] using gCgroup ID in the BPF program: true
I0930 13:33:58.022774  779728 config.go:295] kernel version: 5.14
I0930 13:33:58.022938  779728 config.go:322] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0930 13:33:58.022952  779728 config.go:256] ENABLE_EBPF_CGROUPID: true
I0930 13:33:58.022959  779728 config.go:257] ENABLE_GPU: true
I0930 13:33:58.022964  779728 config.go:258] ENABLE_PROCESS_METRICS: false
I0930 13:33:58.022970  779728 config.go:259] EXPOSE_HW_COUNTER_METRICS: true
I0930 13:33:58.022975  779728 config.go:260] EXPOSE_IRQ_COUNTER_METRICS: true
I0930 13:33:58.022981  779728 config.go:261] EXPOSE_BPF_METRICS: true
I0930 13:33:58.022988  779728 config.go:262] EXPOSE_COMPONENT_POWER: true
I0930 13:33:58.022994  779728 config.go:263] EXPOSE_ESTIMATED_IDLE_POWER_METRICS: true. This only impacts when the power is estimated using pre-prained models. Estimated idle power is meaningful only when Kepler is running on bare-metal or with a single virtual machine (VM) on the node.
I0930 13:33:58.023002  779728 config.go:264] EXPERIMENTAL_BPF_SAMPLE_RATE: 0
I0930 13:33:58.023046  779728 power.go:59] use sysfs to obtain power
I0930 13:33:58.023119  779728 csv_cred.go:68] failed to read csv file: node name u37-h31-000-r7425.rdu3.labs.perfscale.redhat.com not found in file /etc/redfish/redfish.csv
I0930 13:33:58.023130  779728 redfish.go:173] failed to initialize node credential: no supported node credential implementation
I0930 13:33:58.025843  779728 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0930 13:33:58.025851  779728 power.go:79] using none to obtain power
I0930 13:33:58.029368  779728 dcgm.go:66] Initializing dcgm Successful
E0930 13:33:58.029379  779728 device.go:135] Device with type NVML doesn't exist
I0930 13:33:58.029390  779728 device.go:163] Try to Register DCGM
I0930 13:33:58.029395  779728 device.go:121] Adding the device to the registry [gpu][DCGM]
I0930 13:33:58.029402  779728 device.go:170] Registered DCGM
I0930 13:33:58.029411  779728 dcgm.go:69] Using DCGM to obtain processor power
I0930 13:33:58.029415  779728 fakehabana.go:24] Error initializing habana: ERROR_LIBRARY_NOT_FOUND
I0930 13:33:58.033739  779728 nvml.go:50] Initializing nvml Successful
I0930 13:33:58.033779  779728 nvml.go:55] Error registering nvml: DCGM already registered. Skipping NVML
I0930 13:33:58.033792  779728 accelerator.go:139] Initializing the Accelerator of type gpu
I0930 13:33:58.033796  779728 device.go:183] Starting up DCGM
I0930 13:33:58.046281  779728 dcgm.go:455] Created device group "dev-grp-2024-09-30-13-33-58"
I0930 13:33:58.092921  779728 dcgm.go:123] DCGM initialized successfully
I0930 13:33:58.092944  779728 dcgm.go:88] Using DCGM to obtain gpu power
I0930 13:33:58.092952  779728 accelerator.go:151] Startup gpu Accelerator successful

Sep 30 '24 12:09 maryamtahhan

Maybe one improvement would be to not do init twice (once as a registration test then again for the real initialisation), I'm open to suggestion.

Sep 30 '24 12:09 maryamtahhan

@dave-tucker thanks for the ping. I will go through the patch today. Meantime, @maryamtahhan , would you mind rebasing this ?

Oct 22 '24 23:10 sthaha