envd icon indicating copy to clipboard operation
envd copied to clipboard

epic(observability): Observability in Machine Learning

Open VoVAllen opened this issue 2 years ago โ€ข 0 comments

Common requirements:

  • [ ] Log recording (i.e. when run python train.py store stdout/stderr in the log file (or some remote path) )
  • [ ] Error handling (i.e. when the python program killed by system, why? OOM? SegFault?)
  • [ ] Better to know which line is executed when killed by system (better back trace)
  • [ ] System metrics (memory/CPU/GPU statistics)
  • [ ] Notify user when the training is finished
  • [ ] #669

Based on the metrics, user can decide whether to retry the experiment.

VoVAllen avatar May 16 '22 04:05 VoVAllen