Ma, Guokai
Ma, Guokai
@inkcherry is this PR still active? There is merge conflicts.
> @delock, @inkcherry, can you please help investigate the failing xpu-max1100 CI? Thanks! @tjruwase thanks! Our engineer is looking into it.
I have same quesn: I came through this link, https://www.deepspeed.ai/tutorials/mixture-of-experts-inference/?utm_source=chatgpt.com#initializing-for-inference which have this code snipet. However it is not clear where does get_model comes from. ``` import deepspeed import torch.distributed...
@inkcherry Do you know what might cause this inconsistency?
Hi @cynricfu can we mark this issue as completed?
请问这个错是出在训练阶段吗?我用llama3.2-3B测试了TP=4 finetune似乎没有遇到这个问题。
我是单独测试了一下deepspeed的autotp训练功能。我把我跑的环境抽了出来,您试试在您的环境下能 不能跑。 https://github.com/delock/deepspeed_finetune_demo $ ./run.sh 4 meta-llama/Llama-3.2-3B tp_config.json 如果你用的config和这里不一样也可以贴一下,我在我的环境里试试看。
@zzhdbw 从你的错误信息看,模型architecture在你的worker上已经shard好,所以会看到torch.Size([768, 3072])这样四分之一的模型尺寸。但是load模型的时候需要把模型checkpoint也shard好加载上去,你这边调用的model.load_checkpoint似乎没有做这件事。 我在 DeepSpeedExamples/inference/huggingface/text-generation 下执行这条命令来跑autotp deepspeed --num_gpus 4 --bind_cores_to_rank inference-test.py --dtype bfloat16 --model meta-llama/Llama-3.2-3B 没有遇到问题,说明autotp对llama3.2 3B这个模型的支持应该是没有问题,问题应该出在模型load阶段。 @inkcherry 想听听你的建议。 ``` AutoTP: [(, ['self_attn.o_proj', 'mlp.down_proj'])] AutoTP: AutoTP: [(, ['self_attn.o_proj', 'mlp.down_proj'])][(, ['mlp.down_proj',...
@zzhdbw Zenflow是对DeepSpeed中ZeRO offload的一个改进,旨在降低CPU offload对训练性能的影响。具体可以看[这一篇](https://www.deepspeed.ai/tutorials/zenflow/) 可能我提供的finetune demo是从zenflow的demo修改过来造成了这个误解。虽然是从zenflow的demo修改过来,但是通过配合不同的配置文件也能够使用TP。仅需修改配置文件是DeepSpeed的独特之处。 OpenRLHF的TP实现可能需要OpenRLHF的作者来回答,我也在学习OpenRLHF. DeepSpeed的有些文档有些过时了。现在AutoTP已经支持训练,可以参见这一篇(https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/huggingface-tp)
> [@hijkzzz](https://github.com/hijkzzz) - I haven't had time to work on this more unfortunately. [@delock](https://github.com/delock) - [@wenbinc-Bin](https://github.com/wenbinc-Bin)'s PR seems to maybe be the culprit, but could you help take a look...