yghong

Results 3 issues of yghong

**Your question** It seems that B's timing includes W, while W merely accounts for the time of gradient accumulation. In the megatron/core/pipeline_parallel/zb_schedules.py file, the function `schedule_b` counts the duration of...

This PR is to fix this [issue](https://github.com/stanford-futuredata/megablocks/issues/80), when using bf16 and attempting to load pretrained model weights to continue training, the weights in the optimizer will not be reset to...

While working with the load_checkpoint function in the file `third_party/Megatron-LM/megatron/checkpointing.py`, I noticed that the condition on [line 585](https://github.com/stanford-futuredata/Megatron-LM/blob/3a9e3d8de308e6f6398b59d16a8bd7177374f121/megatron/checkpointing.py#L585): ``` if args.fp16 and optimizer is not None: ``` should be modified...

bug