Stas Bekman

Results 664 comments of Stas Bekman

I think it may be pretty essential should we run into instabilities during 200B training. It hasn't been resolved or fallen out of relevance. It's waiting for your magic touch,...

I didn't validate every clause you mentioned but your diagnostics sounds correct, Jaesung. Perhaps the issue stems from the fact that: (1) we decided to skip iterations instead of samples...

The problem is that the current implementation may potentially lead to the same samples being fed through multiple times in a single epoch. I was thinking of sample id being...

Well, the current solution is sort of working at the point of wanting to skip - the main concern is potentially multiple feedings of the same data. So it's hard...

> Seeing as we used jsonl because original Megatron used jsonl, but we can now handle datasets, perhaps we never actually want to use jsonl anymore? This is to be...

I think we currently don't use this, but ZeRO-DP from DeepSpeed (stage 1). I haven't verified which specific code paths it takes. It might help to step through with the...

Here are the instructions added: https://github.com/bigscience-workshop/Megatron-DeepSpeed#deepspeed-pp-and-zero-dp

I'd say it's best to ask upstream at Meg-LM level, as surely they have benchmarked their code. Perhaps @jaredcasper could answer your question.

> Don't you plan to update this repo if the upstream isn't update? This is not what I meant. I meant to first ask the original authors why they did...

And let's start including the actual benchmark code in these comments so that: 1. others can validate it - it's very easy to make subtle mistakes when writing benchmarks 2....