envd
envd copied to clipboard
epic(observability): Observability in Machine Learning
Common requirements:
- [ ] Log recording (i.e. when run
python train.py
store stdout/stderr in the log file (or some remote path) ) - [ ] Error handling (i.e. when the python program killed by system, why? OOM? SegFault?)
- [ ] Better to know which line is executed when killed by system (better back trace)
- [ ] System metrics (memory/CPU/GPU statistics)
- [ ] Notify user when the training is finished
- [ ] #669
Based on the metrics, user can decide whether to retry the experiment.