fzppp comments

Results 10 comments of


                                            fzppp

[CLI]: GPUs Hanging when distributed training caused by `wandb.watch`

Same issue. The whole training process will hold for several minutes every fixed steps (rank 0 GPU will be 0% util). Everything returns to normal as I delete the "wandb_watch["all"]".

多图sft微调训练的时候 ValueError: Image features and image tokens do not match: tokens: 1207, features 1476

> > 大佬们，貌似每次都是训练到0.96epoch左右断了。有没有什么办法？我是单图的。请问解决了嘛😭

多图sft微调训练的时候 ValueError: Image features and image tokens do not match: tokens: 1207, features 1476

> > 大佬们，貌似每次都是训练到0.96epoch左右断了。有没有什么办法？我是单图的。好像是固定某张图造成的？

多图sft微调训练的时候 ValueError: Image features and image tokens do not match: tokens: 1207, features 1476

【已解决】图片太大，pad超过max_prompt_length，被truncate了。要么调大max_prompt_length，要么过滤掉比较大的图片 > > > > 大佬们，貌似每次都是训练到0.96epoch左右断了。有没有什么办法？我是单图的。 > > 好像是固定某张图造成的？

[Bug]: Invalid Device Ordinal on ROCm

> [ehartford](/ehartford) Hello Eric, I just encountered the same issue as you. Have you resolved it?

[Bug]: Invalid Device Ordinal on ROCm

> did you solve this? Did you solve this? Several of my machine encountered this issue at the same time.

[Bug]: Invalid Device Ordinal on ROCm

> > did you solve this? > > Did you solve this? Several of my machine encountered this issue at the same time. Reinstalling the conda env can solve this.

Faild to save the gptq quantized weight on Qwen2 72B.

补充 auto_gptq 0.7.1 transformers 4.42.2 Yi-34B gptq是正常的，Qwen2 72B不行

Faild to save the gptq quantized weight on Qwen2 72B.

我修改了 init_kwargs["max_memory"] = {0: "20GIB", 1: "20GIB", 2: "20GIB", 3:"20GIB", 'cpu': "250GIB"} 机器RAM 251 GB， 4x48G A40

Faild to save the gptq quantized weight on Qwen2 72B.

> @fzp0424 您好请问问题有解决吗换大RAM机器，把量化过程全都搬到CPU上（速度会慢）or 用4卡A100