Jae-Won Chung

Results 58 issues of Jae-Won Chung

[`nvmlDeviceSetAPIRestriction`](https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceCommands.html#group__nvmlDeviceCommands_1gde319ab4fb254a7c42625d6eca2bf37d) seems to offer a way to reduce the permissions required by setting the GPU's frequency (but not power limit). If this works, a workflow could be for a administrator...

enhancement

AMD GPUs are seeing increased adoption. ROCm has nice compatibility layers with PyTorch, too. Plus, ROCm-SMI (apparently) has all the energy-related management APIs we need -- measuring power and energy,...

enhancement

Currently, Chase is just an unmaintained branch. We should make it an actual optimizer in Zeus. Perhaps it's a good idea to generalize it to not just carbon, but time-varying...

enhancement

NVML requires the Linux `SYS_ADMIN` capability for applications to set the GPU's power limit or frequency. In production environments, you can't just give your application containers `SYS_ADMIN` because it allows...

enhancement

CPU and DRAM energy typically take only a small fraction of Deep Learning workloads and, since most of the heavy lifting computations are done by GPUs, it hasn't been that...

enhancement

With multiple jobs tracking their energy consumption with the `ZeusMonitor`, it would be nice to be able to aggregate time/energy metrics to Prometheus. The metric name should be derived from...

enhancement

`GlobalPowerLimitOptimizer` works well for single node data parallel training, but in case of distributed data parallel, GPUs in different nodes should make the same final GPU power limit choice. Assuming...

enhancement

When the energy measurement window for `ZeusMonitor` is too short (less than the update interval of the NVML energy counter), energy will be measured as 0. That's why we have...

enhancement
good first issue

PyTorch has `torch.cuda.synchronize`, which syncs CPU and GPU code execution. This is essential for accurate measurement. But there isn't one for JAX, which we do hope to support as first...

enhancement
good first issue

NVIDIA's Grace CPU is ARM, which means eventually we can expect people from benefiting from a native ARM Docker image. Hence [multi-platform images](https://docs.docker.com/build/building/multi-platform/) is the way forward, but when I...

enhancement
good first issue