inkcherry

Results 11 issues of inkcherry

Add `MPICHRunner` class. This PR is allow user to run deepspeed with mpich launcher. We verified it works with Megatron-Deepspeed in multi-node training.

This is an experimental demo on autoTP training, not for review. Apologies for its somewhat rudimentary draft version, I hope to elucidate this process. Currently, I tested pure TP (DP=1...

save time and memory overhead in maintaining flattened buffers.

use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size. tiny model with sp=4 grad norm test: grad_norm | step1 | step2...

When running the script scripts/pretrain.sh directly, a ModuleNotFoundError: No module named 'llava' may occur. Implement automatic configuration of the Python search path to include the necessary directory. related issue: https://github.com/haotian-liu/LLaVA/issues/1571

In sequence_parallel (Ulysses), the sequence parallel size is constrained by the requirement to be divisible by the number of heads, which prevents some models/workloads from setting a specific sequence parallel...

FYI , @hwchen2017

fix ci hang. improve the ut.