Jae-Won Chung
Jae-Won Chung
[`nvmlDeviceSetAPIRestriction`](https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceCommands.html#group__nvmlDeviceCommands_1gde319ab4fb254a7c42625d6eca2bf37d) seems to offer a way to reduce the permissions required by setting the GPU's frequency (but not power limit). If this works, a workflow could be for a administrator...
AMD GPUs are seeing increased adoption. ROCm has nice compatibility layers with PyTorch, too. Plus, ROCm-SMI (apparently) has all the energy-related management APIs we need -- measuring power and energy,...
Currently, Chase is just an unmaintained branch. We should make it an actual optimizer in Zeus. Perhaps it's a good idea to generalize it to not just carbon, but time-varying...
NVML requires the Linux `SYS_ADMIN` capability for applications to set the GPU's power limit or frequency. In production environments, you can't just give your application containers `SYS_ADMIN` because it allows...
CPU and DRAM energy typically take only a small fraction of Deep Learning workloads and, since most of the heavy lifting computations are done by GPUs, it hasn't been that...
With multiple jobs tracking their energy consumption with the `ZeusMonitor`, it would be nice to be able to aggregate time/energy metrics to Prometheus. The metric name should be derived from...
`GlobalPowerLimitOptimizer` works well for single node data parallel training, but in case of distributed data parallel, GPUs in different nodes should make the same final GPU power limit choice. Assuming...
When the energy measurement window for `ZeusMonitor` is too short (less than the update interval of the NVML energy counter), energy will be measured as 0. That's why we have...
PyTorch has `torch.cuda.synchronize`, which syncs CPU and GPU code execution. This is essential for accurate measurement. But there isn't one for JAX, which we do hope to support as first...
NVIDIA's Grace CPU is ARM, which means eventually we can expect people from benefiting from a native ARM Docker image. Hence [multi-platform images](https://docs.docker.com/build/building/multi-platform/) is the way forward, but when I...