jiaqiw09 comments

Results 25 comments of


                                            jiaqiw09

[bug] deepspeed zero3 +lora 训练mixtrial-7*8模型会hang，无法计算loss

遇到了相同的问题，zero3 + sft 会在训练输出loss的第一步就卡住

[bug] deepspeed zero3 +lora 训练mixtrial-7*8模型会hang，无法计算loss

> 我怀疑是通讯问题，有的伙伴升级了nccl后，deepspeed zero3+lora跑到中途会出现Invalidate trace cache @ step 738: expected module 752, but got module 784，然后继续hang。我这边没有nvlink，哪个伙伴试一下：export NCCL_P2P_LEVEL=NVL，如果设置完环境能跑通deepspeed zero3+lora记得拍一下我。我这边试过，没有效果，还是第一步就卡住了，nvcc版本cuda_12.1.r12.1/compiler.32415258_0

[bug] deepspeed zero3 +lora 训练mixtrial-7*8模型会hang，无法计算loss

> > Mixtral 可能不支持 ZeRO3 > > 这个模型训练难道也没有用zero3?咋训的啊 @hiyouga，请问是通过zero2 + 量化来完成微调的吗

When using `parallelize=True`, raise Runtime Error: expected all tensors to be on the same device

> A temporary workaround is, if the error is > > > RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and...

add NPU support for huggingface.py

@haileyschoelkopf @lintangsutawika would you mind having a check? Best

add NPU support for huggingface.py

@haileyschoelkopf @lintangsutawika thanks for your suggestion and thanks @statelesshz support in fixing code. I have just test code in NPU and GPU, three methods all work. Would you mind having...

Add NPU support for huggingface.py

I have already make a PR(https://github.com/EleutherAI/lm-evaluation-harness/pull/1787), it's better to show how to adapt a NPU to huggingface.py

Add NPU support for huggingface.py

@haileyschoelkopf Hi, would you mind having a check?

Fix: Correctly handle integer device_map for NPU devices in _load_sta…

@gspeter-max thx for your pr. I just check it in ascend npus with torch2.1.0 and torch_npu 2.1.0, it works. Here is my test code: ``` AutoModelForCausalLM.from_pretrained("qwen2_7b",device_map="auto") ``` cc @SunMarc

Fix: Correctly handle integer device_map for NPU devices in _load_sta…

> Thanks a lot for verifying @jiaqiw09! Great to hear it works well on Ascend NPUs with torch 2.1.0 and torch_npu 2.1.0. Let me know if there’s anything else needed...