Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Need model size dumped at init

Open stas00 opened this issue 2 years ago • 5 comments

We need to have a diagnostic model size dumped during the framework init. We currently get a report per rank and not the total.

 > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 1745293312
 > number of parameters on (tensor, pipeline) model parallel rank (2, 1): 1745293312
 > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 1986465792
 > number of parameters on (tensor, pipeline) model parallel rank (3, 7): 1986498560

Later on ZeRO engine does dump the right thing amongst multiple other numbers and repeated on each rank

[2021-10-02 16:08:53,028] [INFO] [engine.py:134:__init__] RANK=0 STAGE=0 LAYERS=7 [0, 7) STAGE_PARAMS=1986465792 (1986.466M) TOTAL_PARAMS=57778896896 (57778.897M) UNIQUE_PARAMS=56814206976 (56814.207M)

But ideally we just want a print like:

Model size: 57B (57778896896 params)

Just on rank 0.

Thanks.

stas00 avatar Oct 04 '21 00:10 stas00

I think I can try and take this issue. However, I have to know what do you do get the diagnostics dump?

jtboing avatar Oct 05 '21 17:10 jtboing

Also, does the dump happen when starting the workflows?

jtboing avatar Oct 05 '21 17:10 jtboing

Thank you for offering to work on this, @jtboing

We, the BS group, haven't added anything yet to this functionality, so it's totally up to you how you do it - please have a look at the various info logged during Meg-DS startup and add it where you feel is right. Probably the best place to do it is where the model is created since you can then easily query the params.

I don't think it really matters where, other than that we could easily grep for something like:

grep "Model size" log.txt

here is my cheatsheet if it helps:

# calculate the number of parameters:
#
# 1. count all params
sum(p.numel() for p in model.parameters())
#
# 2. avoid double counting shared parameters (only if there is a shared storage(), normal tied vars don't have this issue, as model.parameters() doesn't return shared vars)
sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())
#
# 3. count only the trainable parameters:
pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

stas00 avatar Oct 05 '21 19:10 stas00

Hello. Sorry that this hasn't been done sooner and I am trying to get through this now. I am looking for where the Meg-DS startup script/process. Can you point to me which script/process initiates the framework init?

jtboing avatar Nov 25 '21 20:11 jtboing

We have already started sorting it out here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/204 (as a side effect of another need).

stas00 avatar Nov 25 '21 20:11 stas00