Stas Bekman comments

Results 664 comments of


                                            Stas Bekman

possibly --skip-train-iteration-range with multiple entries has a bug

I think it may be pretty essential should we run into instabilities during 200B training. It hasn't been resolved or fallen out of relevance. It's waiting for your magic touch,...

possibly --skip-train-iteration-range with multiple entries has a bug

I didn't validate every clause you mentioned but your diagnostics sounds correct, Jaesung. Perhaps the issue stems from the fact that: (1) we decided to skip iterations instead of samples...

possibly --skip-train-iteration-range with multiple entries has a bug

The problem is that the current implementation may potentially lead to the same samples being fed through multiple times in a single epoch. I was thinking of sample id being...

possibly --skip-train-iteration-range with multiple entries has a bug

Well, the current solution is sort of working at the point of wanting to skip - the main concern is potentially multiple feedings of the same data. So it's hard...

extend preprocess_data_dist to handle jsonl files

> Seeing as we used jsonl because original Megatron used jsonl, but we can now handle datasets, perhaps we never actually want to use jsonl anymore? This is to be...

Implement Gradient Noise Scale monitoring

I think we currently don't use this, but ZeRO-DP from DeepSpeed (stage 1). I haven't verified which specific code paths it takes. It might help to step through with the...

Implement Gradient Noise Scale monitoring

Here are the instructions added: https://github.com/bigscience-workshop/Megatron-DeepSpeed#deepspeed-pp-and-zero-dp

is fused layernorm really better?

I'd say it's best to ask upstream at Meg-LM level, as surely they have benchmarked their code. Perhaps @jaredcasper could answer your question.

is fused layernorm really better?

> Don't you plan to update this repo if the upstream isn't update? This is not what I meant. I meant to first ask the original authors why they did...

is fused layernorm really better?

And let's start including the actual benchmark code in these comments so that: 1. others can validate it - it's very easy to make subtle mistakes when writing benchmarks 2....