DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

'gamma', 'theta' not found in progressive layer drop

Open marchen00 opened this issue 3 years ago • 3 comments

Hi! Thank you guys for the tool and the example. I've been trying to reproduce 'progressive layer dropping' on Roberta and other pretrain methods, but I didn't found where gamma and theta which stated in deepspeed_bsz4k_progressive_layer_drop_config_seq128.json are used in the project.

For 'theta', I found code in nvidia/modelingpreln_layerdrop.py line 1160 theta = kwargs.get('pld_theta', 1.0), but it's 'pld_theta' rather than 'theta'.

For 'gamma', which should be used for drop rate scheduling, I didn't find anywhere using it.

Please kindly inform me if I miss anything, thank you very much.

marchen00 avatar Mar 14 '22 08:03 marchen00

Hi! Thank you guys for the tool and the example. I've been trying to reproduce 'progressive layer dropping' on Roberta and other pretrain methods, but I didn't found where gamma and theta which stated in deepspeed_bsz4k_progressive_layer_drop_config_seq128.json are used in the project.

For 'theta', I found code in nvidia/modelingpreln_layerdrop.py line 1160 theta = kwargs.get('pld_theta', 1.0), but it's 'pld_theta' rather than 'theta'.

For 'gamma', which should be used for drop rate scheduling, I didn't find anywhere using it.

Please kindly inform me if I miss anything, thank you very much.

I also meet the problem. When running Squad, I have the error "unexpect scope output.LayerNorm name in transformer layer." in load_hf_weights_in_bert_kernel. Then I add the code elif name_str.find("output.LayerNorm") > 0: logger.info("Ignore Huggingface weight {} with shape {}".format(name_str, array.shape)) continue NAN or INF would appearing during the training. Have you solve the problem ?

FlyingCat-fa avatar Apr 16 '22 05:04 FlyingCat-fa

@FatCockHu, can you please open a separate ticket for your error? Thanks!

tjruwase avatar Apr 18 '22 13:04 tjruwase

@marchen00, the PLD implementation is split between the DeepSpeed engine and the client. In particular, DeepSpeed maintans the theta and gamma values here, and with this logic makes them available for client's forward usage, such as you highlighted. If you have not done so already, it might be helpful to go through the associated tutorial.

@minjiaz, FYI

tjruwase avatar Apr 18 '22 13:04 tjruwase