Stephen Roller
Stephen Roller
Edit: moved to top
- [x] have the slurm snapshot code in the directory with training logs so there is a 1:1
@KUNAL1612 provided this writeup of some errors and painpoints (internal only): https://docs.google.com/document/d/1Kdiq0ef3IQvHWYQlnLOksAwUblT9lDbhcE5OpLwGfG0/edit
We have cython extensions that depend on numpy. I've found that accidental upgrades of numpy can break those extensions and cause very weird build errors on installation. So I would...
@punitkoura is actually working on this
Punit has a patch I believe.
I looked into this and it was trickier than I expected. We have to make a fair amount of changes to megatron.
Thanks, I have two people trying this on smaller hardware now to see if it works. I'll update with their findings (if you don't hear back further in a few...
Oh, `-n {model_parallel}` may need to be come `-n {num_nodes * model_parallel}`
Did you try the change to the `-n` args? Can you paste a log for us?