zeus
zeus copied to clipboard
Better energy observability
Depends on #30 (Prometheus metric exporter integration).
Energy metrics: How much energy is being consumed? How do users measure savings?
- Grafana dashboard for cluster-wide energy usage and breakdowns to individual training jobs integrated with Zeus
- CPU and DRAM energy measurement (#36) will help distinguish with DCGM
Experiment managers: Each training experiment can be associated with its energy consumption (aggregate & over-time).
- Weights & Biases
- MLFlow