DeepSpeed :bug: fix bug in recovering shared parameters when all parameters are not trainable

This is observed during prompt tuning. Hoever, I am not sure if the solution specified here is the best solution.

Apr 18 '23 21:04 mayank31398

@ShijieZZZZ fyi @tjruwase

Apr 18 '23 21:04 mayank31398

global_step4000.zip This checkpoint fails to load. The current fix works. But not sure if there is a better solution :)

Apr 18 '23 21:04 mayank31398

@mayank31398, thanks!

Apr 18 '23 21:04 tjruwase

@tjruwase do you think there can be a better solution here though? the variable shared_params contains a lot of variables which are not actually shared because of the logic by which this variable is populated.

Apr 18 '23 21:04 mayank31398

@mayank31398, I agree a better solution for handing shared params is needed. Unfortunately, I have not had much bandwidth to think on this. For now, I appreciate fixes such as yours to unblock folks.

Apr 18 '23 21:04 tjruwase

global_step4000.zip This checkpoint fails to load. The current fix works. But not sure if there is a better solution :)

Can you share the stack trace of this failure?

Apr 18 '23 21:04 tjruwase

no worries, lets work on a better solution as soon as possible though. :)

Apr 18 '23 21:04 mayank31398

global_step4000.zip This checkpoint fails to load. The current fix works. But not sure if there is a better solution :)

Can you share the stack trace of this failure?

Traceback (most recent call last):
  File "/dccstor/mayankgpfs/conda/envs/bloom/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/dccstor/mayankgpfs/conda/envs/bloom/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/dccstor/mayankgpfs/scratch/transformer-engine/src/generate.py", line 91, in <module>
    main()
  File "/dccstor/mayankgpfs/scratch/transformer-engine/src/generate.py", line 85, in main
    model.load_ds_checkpoint(args.load_path)
  File "/dccstor/mayankgpfs/scratch/transformer-engine/src/model.py", line 164, in load_ds_checkpoint
    state = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
  File "/dccstor/mayankgpfs/conda/envs/bloom/lib/python3.8/site-packages/deepspeed/utils/zero_to_fp32.py", line 388, in get_fp32_state_dict_from_zero_checkpoint
    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir)
  File "/dccstor/mayankgpfs/conda/envs/bloom/lib/python3.8/site-packages/deepspeed/utils/zero_to_fp32.py", line 180, in _get_fp32_state_dict_from_zero_checkpoint
    return _get_fp32_state_dict_from_zero3_checkpoint(world_size, param_shapes, fp32_flat_groups, buffers,
  File "/dccstor/mayankgpfs/conda/envs/bloom/lib/python3.8/site-packages/deepspeed/utils/zero_to_fp32.py", line 335, in _get_fp32_state_dict_from_zero3_checkpoint
    state_dict[pair[0]] = state_dict[pair[1]]
KeyError: 'model.base_model.transformer.word_embeddings_layernorm.weight'
1821 examples in test split

Like I said, ^^ this shared_params gets incorrectly populated here. This key should not be a part of shared_params.

Apr 18 '23 21:04 mayank31398

@mayank31398, thanks for sharing this stack trace. Are you able to share a repro of misclassified shared params?

Apr 18 '23 21:04 tjruwase

@tjruwase I don't think I can share anything more than the checkpoints as zip: https://github.com/microsoft/DeepSpeed/pull/3295#issuecomment-1513800129

You should be able to easily reproduce with this.

Apr 18 '23 21:04 mayank31398

@tjruwase I don't think I can share anything more than the checkpoints as zip: #3295 (comment)

You should be able to easily reproduce with this.

Yes, you are correct. The checkpoint is sufficient. Thanks!

Apr 18 '23 22:04 tjruwase

Hi @mayank31398, thanks for sharing this. I made one more change here when populating shared_params.

Apr 21 '23 00:04 ShijieZZZZ

@mayank31398, it seems this was subsumed by another merge. Is it okay to close?

May 04 '23 17:05 tjruwase

yes @tjruwase thanks. Sorry for the late response. Closing this

May 15 '23 17:05 mayank31398