Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Add `_validate_args`

Open bhavitvyamalik opened this issue 3 years ago • 6 comments

Closes #124. I found that few checks mentioned in sanity-checks.md were already being done in parse_args like NHIDDEN % NHEADS == 0, GLOBAL_BATCH_SIZE % (MICRO_BATCH_SIZE * DP_SIZE) == 0 so I've added the remaining (it's a WIP for json ds config)

bhavitvyamalik avatar Jan 18 '22 19:01 bhavitvyamalik

Also, few lines like https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/8fea4ea68620d75f2406264408f67016ffe2be32/megatron/arguments.py#L196-L205 have assert which is followed by assigning value to arguments. All such code will stay in parse_args only

bhavitvyamalik avatar Jan 18 '22 19:01 bhavitvyamalik

Could you please flag in the diff where the actual new additions are?

It's hard to review such changes when things are moved around - I'm sure I missed that, but I only see the same code that was just moved elsewhere.

Thank you!

(p.s. also please tag some of us on the PRs I personally don't have notifications for all PRs so check only occasionally unless someone tags me explicitly)

stas00 avatar Jan 18 '22 20:01 stas00

Tbh there's not much that I got done in this PR :/

Most of the checks were already implemented in parse_args. I even checked this and found that many of their checks are implemented in arguments file.

I think we should move all assert statements into _validate_args and those assigning value to arguments in parse_args can stay there only

bhavitvyamalik avatar Jan 19 '22 19:01 bhavitvyamalik

I think the main issue wasn't so much that megatron didn't have the checks. It's that we wanted to get the constraints right before launching megatron. This is due to SLURM environment that is unforgiving for when you make a mistake and the program doesn't start - off you go to the end of the queue. So the point of this doc https://github.com/bigscience-workshop/bigscience/blob/master/train/sanity-checks.md was to prevent this situation. And you confirmed that most of these have already been covered. So nothing to add there.

Now, the specific point of this issue/request: https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/124 was to see if NeoX had some additional checks they have added to the already existing ones here: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/neox_arguments/arguments.py

So I don't think so far this PR is addressing that particular Issue.

As I said earlier perhaps this is just fine as we haven't been really running into troubles at least as of recent. Perhaps we more or less know now the specific needs of this framework.

stas00 avatar Jan 19 '22 19:01 stas00

I'll go through the file again and mention any additional checks not yet implemented here. Also, I realised the few checks already covered are done post satisfying certain conditions which might (~1% chances) go wrong. Explicitly checking them shouldn't be a problem I think? If we plan to exclude them then out of 4 checks, 2 will still be required i.e., 3 and 5

bhavitvyamalik avatar Jan 19 '22 20:01 bhavitvyamalik

Please don't remove any existing checks. The whole purpose of this exercise was to see if we could glean additional checks that were added in NeoX - not to re-arrange or remove the current ones.

Everything else that's already in place should remain.

Additionally, please remember that ideally we try not to re-org any code or even change formatting. This is because it makes it much easier to re-sync with 2 upstream forks for when they add fixes and features.

So unless this is really needed for now let's not do any re-orgs.

If you feel that there is nothing to be gained from NeoX's additions let's just leave this as it is.

stas00 avatar Jan 19 '22 20:01 stas00