inkcherry comments

Results 17 comments of


                                            inkcherry

[Draft][Demo] auto tp training

> @inkcherry Is there a link to the demo code? I'm interested in the potential use case of this feature proposal. hi， @delock FYI:https://github.com/inkcherry/stanford_alpaca/tree/tp_demo see the latest commit msg Due...

apply reduce_scatter_coalesced op

> @inkcherry, thanks for this PR. Are you able to provide some observed memory and latency benefits? Hi, @tjruwase , I use a setup of 4xA800 80G with PyTorch version...

add tp example

> I'm unable to get this to work. > > First I run: `bash run.sh zero2` (all of the options fail with the same error) > > ``` > Time...

add tp example

@hwchen2017 just a reminder in case you miss this~ thanks.

Fix ci hang in torch2.7& improve ut

> @inkcherry, thanks for the quick PR. I have a few questions > > 1. It seems this PR is a workaround using `reuse_dist_env=False` rather than fixing autotp itself. Is...

[BUG] Deepspeed-Inference: support AutoTP for Llama-4 models

FYI @delock @Yejing-Lai

[BUG] Incompatibility Between DeepSpeed AutoTP and BLOOM in Training of Hugging Face models

could you try with ```replace_with_kernel_inject=False``` ?

Trainer.train(resume_from_checkpoint=...) fails when using auto tensor parallel

hi @Peter-Chou ,I gave it a try and it works correctly. Here's my list of checkpoint files — it looks like yours is missing some content compared to mine. It...

[BUG] AutoTP: incorrect total train batch size when using the huggingface trainer API

hi @cynricfu ,thanks for the report. This is likely due to the Transformer display logic using ```total_batch_size``` without accounting for ```dp_world_size != world_size``` You can ignore it for now —...

Is there more detailed documentation for HF AutoTP training?

Hi, @hijkzzz, glad to see you're interested in this. This setup is TP-first, for example, with 4 ranks (0,1,2,3) and tp_size=2. So: [0,1] and [2,3] are TP groups, [0,2] and...