Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Set up a basic MLflow setup
Replicate all the tensorboard logging in Meg-DS, plus logging hyperparams of choice. So on the code level:
- repeat tensorboard 1:1 but log using mlflow api
- find new places where to log new things (e.g. hyperparams)
- WGs that want to log specific events/data will add those directly to Meg-DS code base
- Currently the config is just
--mlflow-dir
on/off toggle which will log all MLFlow events/data
example: https://gist.github.com/tsaoyu/14e39a6d246cb29b107a2cc62a12f7a3
Blocking events:
- [ ] @JetRunner setting up the MLFlow server
The server's at http://deplo-mlflo-1s4xwzhh8tic4-97cf518635d8c72d.elb.us-east-2.amazonaws.com/