Ma, Guokai comments

Results 180 comments of


                                            Ma, Guokai

gather output layout support for column parallel

@inkcherry is this PR still active? There is merge conflicts.

Improve overflow handling in ZeRO

> @delock, @inkcherry, can you please help investigate the failing xpu-max1100 CI? Thanks! @tjruwase thanks! Our engineer is looking into it.

How to perform inference MoE model with expert parallel

I have same quesn: I came through this link, https://www.deepspeed.ai/tutorials/mixture-of-experts-inference/?utm_source=chatgpt.com#initializing-for-inference which have this code snipet. However it is not clear where does get_model comes from. ``` import deepspeed import torch.distributed...

[BUG] AutoTP: incorrect total train batch size when using the huggingface trainer API

@inkcherry Do you know what might cause this inconsistency?

[BUG] AutoTP: incorrect total train batch size when using the huggingface trainer API

Hi @cynricfu can we mark this issue as completed?

SFT deepspeed tp bug !

请问这个错是出在训练阶段吗？我用llama3.2-3B测试了TP=4 finetune似乎没有遇到这个问题。

SFT deepspeed tp bug !

我是单独测试了一下deepspeed的autotp训练功能。我把我跑的环境抽了出来，您试试在您的环境下能不能跑。 https://github.com/delock/deepspeed_finetune_demo $ ./run.sh 4 meta-llama/Llama-3.2-3B tp_config.json 如果你用的config和这里不一样也可以贴一下，我在我的环境里试试看。

SFT deepspeed tp bug !

@zzhdbw 从你的错误信息看，模型architecture在你的worker上已经shard好，所以会看到torch.Size([768, 3072])这样四分之一的模型尺寸。但是load模型的时候需要把模型checkpoint也shard好加载上去，你这边调用的model.load_checkpoint似乎没有做这件事。我在 DeepSpeedExamples/inference/huggingface/text-generation 下执行这条命令来跑autotp deepspeed --num_gpus 4 --bind_cores_to_rank inference-test.py --dtype bfloat16 --model meta-llama/Llama-3.2-3B 没有遇到问题，说明autotp对llama3.2 3B这个模型的支持应该是没有问题，问题应该出在模型load阶段。 @inkcherry 想听听你的建议。 ``` AutoTP: [(, ['self_attn.o_proj', 'mlp.down_proj'])] AutoTP: AutoTP: [(, ['self_attn.o_proj', 'mlp.down_proj'])][(, ['mlp.down_proj',...

SFT deepspeed tp bug !

@zzhdbw Zenflow是对DeepSpeed中ZeRO offload的一个改进，旨在降低CPU offload对训练性能的影响。具体可以看[这一篇](https://www.deepspeed.ai/tutorials/zenflow/) 可能我提供的finetune demo是从zenflow的demo修改过来造成了这个误解。虽然是从zenflow的demo修改过来，但是通过配合不同的配置文件也能够使用TP。仅需修改配置文件是DeepSpeed的独特之处。 OpenRLHF的TP实现可能需要OpenRLHF的作者来回答，我也在学习OpenRLHF. DeepSpeed的有些文档有些过时了。现在AutoTP已经支持训练，可以参见这一篇（https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/huggingface-tp）

[BUG] OOM when train 70B models using deepspeed 0.16.4

> [@hijkzzz](https://github.com/hijkzzz) - I haven't had time to work on this more unfortunately. [@delock](https://github.com/delock) - [@wenbinc-Bin](https://github.com/wenbinc-Bin)'s PR seems to maybe be the culprit, but could you help take a look...