chore: gpu cleanup
- [x] remove init() functions in gpu devices - change to dynamic device registration.
- [x] remove custom gpu tags from the Makefile with the exception of habana (libhlml.go cgo implementation in vendor directory is problematic) so that all device implementations are built in.
- [x] add missing copyright tags
- [x] cleanup dcgm, nvml, and habana globals
🤖 SeineSailor
Here's a concise summary of the pull request changes:
Summary: The "chore: gpu cleanup" pull request refactors and cleans up GPU-related code, centralizing GPU configuration and updating accelerator interfaces. Key changes include:
- Modified
init()functions in GPU devices - Removed custom GPU flags
- Updated
GetActiveAcceleratorByTypefunction calls - Centralized GPU configuration in
main()function - Refactored
AcceleratorTypeconstant and related functions - Changed
Registrytype to use string keys instead ofAcceleratorType - Updated
Device()method inAcceleratorinterface - Changed
New()function to accept a string type for the accelerator - Added
habanaCheckfunction for Habana GPU initialization error checking
Impact: These internal cleanups do not affect the external interface or behavior of the code. However, developers should be aware of the updated import path and altered function signatures, which may impact dependent code.
Observations/Suggestions:
- The changes seem to improve code organization and consistency, making it easier to maintain and extend GPU-related functionality.
- It's essential to thoroughly test the updated code to ensure it doesn't introduce any regressions or compatibility issues.
- Consider adding more detailed comments or documentation to explain the reasoning behind these changes and how they affect the codebase.
- If there are any dependent projects or code that rely on the old GPU flags or function signatures, ensure they are updated accordingly to avoid compatibility issues.
will rebase once https://github.com/sustainable-computing-io/kepler/pull/1788 is merged
Tested with Kubeadm :
kubectl logs -n kepler kepler-exporter-wm9sd
WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
I0930 13:33:58.022593 779728 exporter.go:103] Kepler running on version: 0b12e73b-dirty
I0930 13:33:58.022740 779728 config.go:293] using gCgroup ID in the BPF program: true
I0930 13:33:58.022774 779728 config.go:295] kernel version: 5.14
I0930 13:33:58.022938 779728 config.go:322] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0930 13:33:58.022952 779728 config.go:256] ENABLE_EBPF_CGROUPID: true
I0930 13:33:58.022959 779728 config.go:257] ENABLE_GPU: true
I0930 13:33:58.022964 779728 config.go:258] ENABLE_PROCESS_METRICS: false
I0930 13:33:58.022970 779728 config.go:259] EXPOSE_HW_COUNTER_METRICS: true
I0930 13:33:58.022975 779728 config.go:260] EXPOSE_IRQ_COUNTER_METRICS: true
I0930 13:33:58.022981 779728 config.go:261] EXPOSE_BPF_METRICS: true
I0930 13:33:58.022988 779728 config.go:262] EXPOSE_COMPONENT_POWER: true
I0930 13:33:58.022994 779728 config.go:263] EXPOSE_ESTIMATED_IDLE_POWER_METRICS: true. This only impacts when the power is estimated using pre-prained models. Estimated idle power is meaningful only when Kepler is running on bare-metal or with a single virtual machine (VM) on the node.
I0930 13:33:58.023002 779728 config.go:264] EXPERIMENTAL_BPF_SAMPLE_RATE: 0
I0930 13:33:58.023046 779728 power.go:59] use sysfs to obtain power
I0930 13:33:58.023119 779728 csv_cred.go:68] failed to read csv file: node name u37-h31-000-r7425.rdu3.labs.perfscale.redhat.com not found in file /etc/redfish/redfish.csv
I0930 13:33:58.023130 779728 redfish.go:173] failed to initialize node credential: no supported node credential implementation
I0930 13:33:58.025843 779728 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0930 13:33:58.025851 779728 power.go:79] using none to obtain power
I0930 13:33:58.029368 779728 dcgm.go:66] Initializing dcgm Successful
E0930 13:33:58.029379 779728 device.go:135] Device with type NVML doesn't exist
I0930 13:33:58.029390 779728 device.go:163] Try to Register DCGM
I0930 13:33:58.029395 779728 device.go:121] Adding the device to the registry [gpu][DCGM]
I0930 13:33:58.029402 779728 device.go:170] Registered DCGM
I0930 13:33:58.029411 779728 dcgm.go:69] Using DCGM to obtain processor power
I0930 13:33:58.029415 779728 fakehabana.go:24] Error initializing habana: ERROR_LIBRARY_NOT_FOUND
I0930 13:33:58.033739 779728 nvml.go:50] Initializing nvml Successful
I0930 13:33:58.033779 779728 nvml.go:55] Error registering nvml: DCGM already registered. Skipping NVML
I0930 13:33:58.033792 779728 accelerator.go:139] Initializing the Accelerator of type gpu
I0930 13:33:58.033796 779728 device.go:183] Starting up DCGM
I0930 13:33:58.046281 779728 dcgm.go:455] Created device group "dev-grp-2024-09-30-13-33-58"
I0930 13:33:58.092921 779728 dcgm.go:123] DCGM initialized successfully
I0930 13:33:58.092944 779728 dcgm.go:88] Using DCGM to obtain gpu power
I0930 13:33:58.092952 779728 accelerator.go:151] Startup gpu Accelerator successful
Maybe one improvement would be to not do init twice (once as a registration test then again for the real initialisation), I'm open to suggestion.
@dave-tucker thanks for the ping. I will go through the patch today. Meantime, @maryamtahhan , would you mind rebasing this ?