:bug: fix bug in recovering shared parameters when all parameters are not trainable
This is observed during prompt tuning. Hoever, I am not sure if the solution specified here is the best solution.
@ShijieZZZZ fyi @tjruwase
global_step4000.zip This checkpoint fails to load. The current fix works. But not sure if there is a better solution :)
@mayank31398, thanks!
@tjruwase do you think there can be a better solution here though?
the variable shared_params contains a lot of variables which are not actually shared because of the logic by which this variable is populated.
@mayank31398, I agree a better solution for handing shared params is needed. Unfortunately, I have not had much bandwidth to think on this. For now, I appreciate fixes such as yours to unblock folks.
global_step4000.zip This checkpoint fails to load. The current fix works. But not sure if there is a better solution :)
Can you share the stack trace of this failure?
no worries, lets work on a better solution as soon as possible though. :)
global_step4000.zip This checkpoint fails to load. The current fix works. But not sure if there is a better solution :)
Can you share the stack trace of this failure?
Traceback (most recent call last):
File "/dccstor/mayankgpfs/conda/envs/bloom/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/dccstor/mayankgpfs/conda/envs/bloom/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/dccstor/mayankgpfs/scratch/transformer-engine/src/generate.py", line 91, in <module>
main()
File "/dccstor/mayankgpfs/scratch/transformer-engine/src/generate.py", line 85, in main
model.load_ds_checkpoint(args.load_path)
File "/dccstor/mayankgpfs/scratch/transformer-engine/src/model.py", line 164, in load_ds_checkpoint
state = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
File "/dccstor/mayankgpfs/conda/envs/bloom/lib/python3.8/site-packages/deepspeed/utils/zero_to_fp32.py", line 388, in get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir)
File "/dccstor/mayankgpfs/conda/envs/bloom/lib/python3.8/site-packages/deepspeed/utils/zero_to_fp32.py", line 180, in _get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero3_checkpoint(world_size, param_shapes, fp32_flat_groups, buffers,
File "/dccstor/mayankgpfs/conda/envs/bloom/lib/python3.8/site-packages/deepspeed/utils/zero_to_fp32.py", line 335, in _get_fp32_state_dict_from_zero3_checkpoint
state_dict[pair[0]] = state_dict[pair[1]]
KeyError: 'model.base_model.transformer.word_embeddings_layernorm.weight'
1821 examples in test split
Like I said, ^^ this shared_params gets incorrectly populated here. This key should not be a part of shared_params.
@mayank31398, thanks for sharing this stack trace. Are you able to share a repro of misclassified shared params?
@tjruwase I don't think I can share anything more than the checkpoints as zip: https://github.com/microsoft/DeepSpeed/pull/3295#issuecomment-1513800129
You should be able to easily reproduce with this.
@tjruwase I don't think I can share anything more than the checkpoints as zip: #3295 (comment)
You should be able to easily reproduce with this.
Yes, you are correct. The checkpoint is sufficient. Thanks!
Hi @mayank31398, thanks for sharing this. I made one more change here when populating shared_params.
@mayank31398, it seems this was subsumed by another merge. Is it okay to close?
yes @tjruwase thanks. Sorry for the late response. Closing this