Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Set up a basic MLflow setup

Open sashavor opened this issue 2 years ago • 1 comments

Replicate all the tensorboard logging in Meg-DS, plus logging hyperparams of choice. So on the code level:

  1. repeat tensorboard 1:1 but log using mlflow api
  2. find new places where to log new things (e.g. hyperparams)
  3. WGs that want to log specific events/data will add those directly to Meg-DS code base
  4. Currently the config is just --mlflow-dir on/off toggle which will log all MLFlow events/data

example: https://gist.github.com/tsaoyu/14e39a6d246cb29b107a2cc62a12f7a3

Blocking events:

  • [ ] @JetRunner setting up the MLFlow server

sashavor avatar Aug 31 '21 18:08 sashavor

The server's at http://deplo-mlflo-1s4xwzhh8tic4-97cf518635d8c72d.elb.us-east-2.amazonaws.com/

JetRunner avatar Sep 01 '21 10:09 JetRunner