Jae-Won Chung

Results 58 issues of Jae-Won Chung

Perseus is an energy scheduler for large model training (although we're looking into applying this for large model inference, too). Perseus requires the time and energy consumption profiling results of...

enhancement

I'm drawing a timeline of DCGM metrics (gathered with `DcgmReader` and update interval 10 ms) together with Python application-level metrics like the number of running requests at each moment. DCGM...

We want to check whether the ticket's `status` field was actually patched to `Status::InProgress`.

Diffusion models for image generation is a workload that many people would be interested in.

Coding has become an important task for LLMs. We have good publicly available models too: CodeLlama, StarCoder, SantaCoder, Salesforce CodeT5 & CodeGen, etc. There's also a nice evaluation planforms like...

NVML is not supported on Jetson platforms, e.g. Jetson Nano. Supporting Jetson platforms can be very useful for people who do ML/MLSys research on embedded platforms. Related discussion entry: https://github.com/ml-energy/zeus/discussions/102...

enhancement

In a system, there can be multiple CPU sockets. Just like GPUs, after #36, Zeus will be able to measure the energy consumption of specific CPU packages. However, unlike GPUs,...

enhancement

Right now, `zeusd` assumes NVML operations will mostly succeed. However, for this to be more robust, we want to handle more failure cases. NVML might hang for some unknown reason,...

enhancement

Depends on #30 (Prometheus metric exporter integration). Energy metrics: How much energy is being consumed? How do users measure savings? - Grafana dashboard for cluster-wide energy usage and breakdowns to...

integration
roadmap