Realtime training visualization using wandb

Open chinthysl opened this issue 1 year ago • 2 comments

Just an additional script to visualize and track metrics in realtime using wandb. This will be useful when have longer training runs and multinode training lasting many hours.

Metric graphs from 2 training runs metric_graphs

System overview system_graphs

May 29 '24 07:05 chinthysl

Ok yes this is probably a good idea 😅 . I'll leave some comments.

May 29 '24 14:05 karpathy

One more thing to be careful with and think about 🤔 . If the process crashes or hangs and gets restarted, it starts to log again from the last checkpoint. i.e. some steps in the log will be repeated because we just append to the end. I don't know what wandb does in these cases. E.g. see run124M.sh for how I have a while loop over the job, and potentially CTRL+C it manually and see what happens

May 29 '24 15:05 karpathy