Stas Bekman

Results 664 comments of Stas Bekman

The CI won't have multi-node and I think neither any of us, and JZ can't give us multi-node in real time to support a test suite, so perhaps there can...

Could you please flag in the diff where the actual new additions are? It's hard to review such changes when things are moved around - I'm sure I missed that,...

I think the main issue wasn't so much that megatron didn't have the checks. It's that we wanted to get the constraints right **before** launching megatron. This is due to...

Please don't remove any existing checks. The whole purpose of this exercise was to see if we could glean additional checks that were added in NeoX - not to re-arrange...

Using a good visual debugger is the best way to do this kind of work. PyCharm is free and excellent. But surely there are others. The grad norms are gathered...

@tjruwase, do you have any suggestions as to how we could change stage2 to optionally calculate grad norm per layer and not just the L2 average over all gradients? The...

@jaketae, the other simpler approach is to not touch Deepspeed and develop a generic tool that will work with any framework or model. Please have a look at this tool...

Re: doing it separately: Try the weights tool first. There is a detailed guide here: https://huggingface.co/transformers/master/debugging.html So we want the same but for gradients, the concept is identical - but...

I haven't tried your code but conceptually your code looks right, Jake. I think it should just work regardless of how it's used. We aren't dealing with ZeRO-2 where grads...

What you said and please add the layer index. If you remember we might not log all layers but only sample a few to keep things faster and not too...