Jae-Won Chung issues

Results 58 issues of


                                            Jae-Won Chung

Test and verify `nvmlDeviceSetAPIRestriction`

[`nvmlDeviceSetAPIRestriction`](https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceCommands.html#group__nvmlDeviceCommands_1gde319ab4fb254a7c42625d6eca2bf37d) seems to offer a way to reduce the permissions required by setting the GPU's frequency (but not power limit). If this works, a workflow could be for a administrator...

enhancement

Support for AMD GPUs

AMD GPUs are seeing increased adoption. ROCm has nice compatibility layers with PyTorch, too. Plus, ROCm-SMI (apparently) has all the energy-related management APIs we need -- measuring power and energy,...

enhancement

Carbon-aware Zeus (Chase) as an optimizer

Currently, Chase is just an unmaintained branch. We should make it an actual optimizer in Zeus. Perhaps it's a good idea to generalize it to not just carbon, but time-varying...

enhancement

Power management server

NVML requires the Linux `SYS_ADMIN` capability for applications to set the GPU's power limit or frequency. In production environments, you can't just give your application containers `SYS_ADMIN` because it allows...

enhancement

CPU and DRAM energy measurement

CPU and DRAM energy typically take only a small fraction of Deep Learning workloads and, since most of the heavy lifting computations are done by GPUs, it hasn't been that...

enhancement

Cluster-wide energy metric aggregation

With multiple jobs tracking their energy consumption with the `ZeusMonitor`, it would be nice to be able to aggregate time/energy metrics to Prometheus. The metric name should be derived from...

enhancement

`GlobalPowerLimitOptimizer` for distributed data parallel training

`GlobalPowerLimitOptimizer` works well for single node data parallel training, but in case of distributed data parallel, GPUs in different nodes should make the same final GPU power limit choice. Assuming...

enhancement

Add warning for too short window in `ZeusMonitor`

When the energy measurement window for `ZeusMonitor` is too short (less than the update interval of the NVML energy counter), energy will be measured as 0. That's why we have...

enhancement

good first issue

Generalized CUDA synchronize

PyTorch has `torch.cuda.synchronize`, which syncs CPU and GPU code execution. This is essential for accurate measurement. But there isn't one for JAX, which we do hope to support as first...

enhancement

good first issue

Automatic multi-arch Docker images in CI

NVIDIA's Grace CPU is ARM, which means eventually we can expect people from benefiting from a native ARM Docker image. Hence [multi-platform images](https://docs.docker.com/build/building/multi-platform/) is the way forward, but when I...

enhancement

good first issue