Realtime training visualization using wandb
Just an additional script to visualize and track metrics in realtime using wandb. This will be useful when have longer training runs and multinode training lasting many hours.
Metric graphs from 2 training runs
System overview
Ok yes this is probably a good idea 😅 . I'll leave some comments.
One more thing to be careful with and think about 🤔 . If the process crashes or hangs and gets restarted, it starts to log again from the last checkpoint. i.e. some steps in the log will be repeated because we just append to the end. I don't know what wandb does in these cases. E.g. see run124M.sh for how I have a while loop over the job, and potentially CTRL+C it manually and see what happens