Stas Bekman

Results 664 comments of Stas Bekman

@TevenLeScao , while we are at it why do we print: 1. estimated model parameters: 2. estimated model parameters without embeddings: for the whole model? What's the practical point of...

OK, then something is very wrong in the reporting. e.g. for the 104B model it prints: ``` estimated model parameters without embeddings: 103.368064 estimated model parameters: 125.22432 ``` The formula...

@TevenLeScao, I looked deeper and we have wrong counting for PP, e.g. see: First let's do a manual approximate math for this config: ``` NLAYERS=8 NHIDDEN=512 SEQ_LEN=1024 VOCAB_SIZE=50257 EMB_PARAMS=$((VOCAB_SIZE *...

yes, and I pointed to the code that leads to correct data! The current 20 to 50% over-reported size leads to very different results over what is really happening. Do...

And to debug while you're working at it, you may choose the same as I did [here](https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/204#issuecomment-981242703) that is tweaking: ``` N_GPUS=2 TP_SIZE=2 PP_SIZE=1 ``` to 3 different set ups...

both outputs already show up in the same log file. Please see https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/204#issuecomment-981242703 - I just filtered out that information from the rest of the logged info. You just change...

Need to try the suggestion here https://github.com/machulav/ec2-github-runner/issues/69#issuecomment-913026079 to have separate workflow and test-job concurrency levels. To `cancel-in-progress` only the test job, that way the race condition of cancelling the start...

@thomasw21 also shared this: https://github.com/microsoft/DeepSpeed/blob/c7f3bc51c27884ad80dcafe4aa60f070c1dfa26e/deepspeed/runtime/pipe/engine.py#L117-L126 which seems to be related to this issue.

What are you after the bf16 weights split across TPs? or the optim states - that's 2.3TB of data! I don't know what: "had problem in resolving the layer files"...

ok, so you do want the original weights - got it - we have a script that converts from Meg-DS to HF, but not the other way around. I will...