matrixssy comments

Results 15 comments of


                                            matrixssy

trafficstars

loss 0 with load_in_8bit = True

Encountered the same issue! Training ALPACA_LORA-13B.

[bug] deepspeed zero3 +lora 训练mixtrial-7*8模型会hang，无法计算loss

> Mixtral 可能不支持 ZeRO3 在deepspeed chat架构上我也发现，zero3开启后显存似乎没有正常切分，导致最后OOM

[bug] deepspeed zero3 +lora 训练mixtrial-7*8模型会hang，无法计算loss

我这边不是卡住而是更诡异的报错 ``` [INFO|trainer.py:1709] 2023-12-29 09:19:48,906 >> ***** Running training ***** [INFO|trainer.py:1710] 2023-12-29 09:19:48,906 >> Num examples = 64,000 [INFO|trainer.py:1711] 2023-12-29 09:19:48,906 >> Num Epochs = 9,223,372,036,854,775,807 [INFO|trainer.py:1712] 2023-12-29 09:19:48,906 >>...

Support Mixtral 8*7B MOE

#649

Support Mixtral 8*7B MOE

> Great work! Could you provide a script of convert megatron mixtral to hf ? Still working on it.

Support Mixtral 8*7B MOE

> Hi, I wonder if the loss is normal after converting and training mixtral with megatron at your computer. I apply this PR and the initial loss is quite high,...

Support Mixtral 8*7B MOE

> Hi, I wonder if the loss is normal after converting and training mixtral with megatron at your computer. I apply this PR and the initial loss is quite high,...

Support Mixtral 8*7B MOE

> Hi, I fix a bug in my script and now the initial loss is normal *(around 2.3 in arxiv dataset). Thanks for your contribution! > > also, I have...

Support Mixtral 8*7B MOE

> Hi, @matrixssy. Thanks for your contribution, there are some ongoing efforts in NVIDIA internally working on the Mixtral 8x7B example. We will support convert HF checkpoint to MCore checkpoint...

Support Mixtral 8*7B MOE

> Hi, when I set target-tensor-parallel-size > 1 , I got the following errors. only setting target-tensor-parallel-size = 1 works. Is it possible that it is related to the following...