Stas Bekman comments

Results 664 comments of


                                            Stas Bekman

[testing] data size / dynamic downloads - test speed and repo bloat

The CI won't have multi-node and I think neither any of us, and JZ can't give us multi-node in real time to support a test suite, so perhaps there can...

Add `_validate_args`

Could you please flag in the diff where the actual new additions are? It's hard to review such changes when things are moved around - I'm sure I missed that,...

Add `_validate_args`

I think the main issue wasn't so much that megatron didn't have the checks. It's that we wanted to get the constraints right **before** launching megatron. This is due to...

Add `_validate_args`

Please don't remove any existing checks. The whole purpose of this exercise was to see if we could glean additional checks that were added in NeoX - not to re-arrange...

extract and log grad norm for individual layers

Using a good visual debugger is the best way to do this kind of work. PyCharm is free and excellent. But surely there are others. The grad norms are gathered...

extract and log grad norm for individual layers

@tjruwase, do you have any suggestions as to how we could change stage2 to optionally calculate grad norm per layer and not just the L2 average over all gradients? The...

extract and log grad norm for individual layers

@jaketae, the other simpler approach is to not touch Deepspeed and develop a generic tool that will work with any framework or model. Please have a look at this tool...

extract and log grad norm for individual layers

Re: doing it separately: Try the weights tool first. There is a detailed guide here: https://huggingface.co/transformers/master/debugging.html So we want the same but for gradients, the concept is identical - but...

extract and log grad norm for individual layers

I haven't tried your code but conceptually your code looks right, Jake. I think it should just work regardless of how it's used. We aren't dealing with ZeRO-2 where grads...

extract and log grad norm for individual layers

What you said and please add the layer index. If you remember we might not log all layers but only sample a few to keep things faster and not too...