Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
[chkpt conversion] handle the case where tp=0 , should be 1
This PR is trying to fix:
Traceback (most recent call last):
File "/gpfswork/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py", line 83, in <module>
main()
File "/gpfswork/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py", line 22, in main
ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp, args.target_pp)
File "/gpfsssd/worksf/projects/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_checkpoint.py", line 36, in __init__
self.dp_degree = len(self.zero_files) // (self.original_pp_degree * self.original_tp_degree)
ZeroDivisionError: integer division or modulo by zero
it seems we have original_pp_degree = 0 rather than 1
@thomasw21, please feel free to close this one or build on top of it, either way works.
I think we should test too that original_pp_degree != 0
@stas00 perfect, I'll probably convert all of them to asserts. I've yet to rule out that the checkpoint file is corrupted ...