Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

[chkpt conversion] handle the case where tp=0 , should be 1

Open stas00 opened this issue 4 years ago • 2 comments

This PR is trying to fix:

Traceback (most recent call last): 
 File "/gpfswork/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py", line 83, in <module> 
   main() 
 File "/gpfswork/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_to_transformers.py", line 22, in main 
   ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp, args.target_pp) 
 File "/gpfsssd/worksf/projects/rech/six/commun/code/Megatron-DeepSpeed/tools/convert_checkpoint/deepspeed_checkpoint.py", line 36, in __init__ 
   self.dp_degree = len(self.zero_files) // (self.original_pp_degree * self.original_tp_degree) 
ZeroDivisionError: integer division or modulo by zero

it seems we have original_pp_degree = 0 rather than 1

stas00 avatar Oct 20 '21 15:10 stas00

@thomasw21, please feel free to close this one or build on top of it, either way works.

I think we should test too that original_pp_degree != 0

stas00 avatar Oct 20 '21 16:10 stas00

@stas00 perfect, I'll probably convert all of them to asserts. I've yet to rule out that the checkpoint file is corrupted ...

thomasw21 avatar Oct 21 '21 13:10 thomasw21