Stephen Roller

Results 117 comments of Stephen Roller

Edit: moved to top

- [x] have the slurm snapshot code in the directory with training logs so there is a 1:1

@KUNAL1612 provided this writeup of some errors and painpoints (internal only): https://docs.google.com/document/d/1Kdiq0ef3IQvHWYQlnLOksAwUblT9lDbhcE5OpLwGfG0/edit

We have cython extensions that depend on numpy. I've found that accidental upgrades of numpy can break those extensions and cause very weird build errors on installation. So I would...

@punitkoura is actually working on this

I looked into this and it was trickier than I expected. We have to make a fair amount of changes to megatron.

Thanks, I have two people trying this on smaller hardware now to see if it works. I'll update with their findings (if you don't hear back further in a few...

Oh, `-n {model_parallel}` may need to be come `-n {num_nodes * model_parallel}`

Did you try the change to the `-n` args? Can you paste a log for us?