Baibaifan issues

Results 5 issues of


Baibaifan

[BUG]There is a problem with asynchronous communication in zero stage2

**Describe the bug** There is a problem with asynchronous communication in zero stage2 by using `overlap_comm`. **To Reproduce** Steps to reproduce the behavior: Use deepspeed zero-2 on the hugging face...

bug

training

RuntimeError: Expected to mark a variable ready only once.

Hi! I tried to use peft model with Trainer, got the next error, **I used gradient_checkpointing**: ``` RuntimeError: Expected to mark a variable ready only once. This error is caused...

[Fleet]pipeline_tutorials

Add pipeline tutorials and group_sharded tutorials.

[BUG] The bug about the options of the Megatron-core, transformer-impl and flash-attention.

**Describe the bug** Open --use-mcore-models and --use-flash-attn, set --transformer-impl local, and do not use flash-attention. **To Reproduce** N/A **Expected behavior** N/A **Stack trace/logs** N/A **Environment (please complete the following information):**...

[BUG] The problems with bucket and shared_embedding.

**Describe the bug** ![image](https://github.com/NVIDIA/Megatron-LM/assets/39549453/c1e3ea24-e371-4818-9d9f-b916bb34e0fe) As shown in the figure above, `shared_embedding` and other parameters are distinguished when building the `bucket`. When the `data_end_index` of the parameter before `shared_embedding` is not...