inkcherry issues

Results 11 issues of


                                            inkcherry

Add MPICH Multinode Runner

Add `MPICHRunner` class. This PR is allow user to run deepspeed with mpich launcher. We verified it works with Megatron-Deepspeed in multi-node training.

[Draft][Demo] auto tp training

This is an experimental demo on autoTP training, not for review. Apologies for its somewhat rudimentary draft version, I hope to elucidate this process. Currently, I tested pure TP (DP=1...

apply reduce_scatter_coalesced op

save time and memory overhead in maintaining flattened buffers.

fix sequence parallel(Ulysses) grad scale for zero0

use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size. tiny model with sp=4 grad norm test: grad_norm | step1 | step2...

automatically added to python module search path

When running the script scripts/pretrain.sh directly, a ModuleNotFoundError: No module named 'llava' may occur. Implement automatic configuration of the Python search path to include the necessary directory. related issue: https://github.com/haotian-liu/LLaVA/issues/1571

sequence parallel for uneven heads

In sequence_parallel (Ulysses), the sequence parallel size is constrained by the requirement to be divisible by the number of heads, which prevents some models/workloads from setting a specific sequence parallel...

add tp example

FYI , @hwchen2017

Fix ci hang in torch2.7& improve ut

fix ci hang. improve the ut.

gather output layout support for column parallel

related-change with https://github.com/microsoft/DeepSpeed/pull/5445

wip. pending on perf test