ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

writing Prometheus exporter

Open SMesForoush opened this issue 3 years ago • 1 comments

Describe the feature

Application monitoring is essential for every production software system. Prometheus is an open-source monitoring system which was created in 2012 by Soundcloud. The Prometheus server collects metrics from your servers and other monitoring targets by pulling their metric endpoints over HTTP at a predefined time interval. For ephemeral and batch jobs, for which metrics can't be scraped periodically due to their short-lived nature, Prometheus offers a Pushgateway. This is an intermediate server that monitoring targets can push their metrics to before exiting. Adding Prometheus to the project can help monitor the health of the cluster. By adding custom metrics monitoring of the training process can be easily done.

SMesForoush avatar Mar 31 '22 14:03 SMesForoush

Hi @SMesForoush ! Thank you for your feature request!

We are thinking over your idea, and would you bother telling us something about your usage scenario? It helps a lot.

ofey404 avatar Jul 27 '22 04:07 ofey404

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 03:04 binmakeswell