Stas Bekman
Stas Bekman
I think it may be pretty essential should we run into instabilities during 200B training. It hasn't been resolved or fallen out of relevance. It's waiting for your magic touch,...
I didn't validate every clause you mentioned but your diagnostics sounds correct, Jaesung. Perhaps the issue stems from the fact that: (1) we decided to skip iterations instead of samples...
The problem is that the current implementation may potentially lead to the same samples being fed through multiple times in a single epoch. I was thinking of sample id being...
Well, the current solution is sort of working at the point of wanting to skip - the main concern is potentially multiple feedings of the same data. So it's hard...
> Seeing as we used jsonl because original Megatron used jsonl, but we can now handle datasets, perhaps we never actually want to use jsonl anymore? This is to be...
I think we currently don't use this, but ZeRO-DP from DeepSpeed (stage 1). I haven't verified which specific code paths it takes. It might help to step through with the...
Here are the instructions added: https://github.com/bigscience-workshop/Megatron-DeepSpeed#deepspeed-pp-and-zero-dp
I'd say it's best to ask upstream at Meg-LM level, as surely they have benchmarked their code. Perhaps @jaredcasper could answer your question.
> Don't you plan to update this repo if the upstream isn't update? This is not what I meant. I meant to first ask the original authors why they did...
And let's start including the actual benchmark code in these comments so that: 1. others can validate it - it's very easy to make subtle mistakes when writing benchmarks 2....