fzppp
fzppp
Same issue. The whole training process will hold for several minutes every fixed steps (rank 0 GPU will be 0% util). Everything returns to normal as I delete the "wandb_watch["all"]".
> > 大佬们,貌似每次都是训练到0.96epoch左右断了。有没有什么办法?我是单图的。 请问解决了嘛😭
> > 大佬们,貌似每次都是训练到0.96epoch左右断了。有没有什么办法?我是单图的。 好像是固定某张图造成的?
【已解决】图片太大,pad超过max_prompt_length,被truncate了。要么调大max_prompt_length,要么过滤掉比较大的图片 > > > > 大佬们,貌似每次都是训练到0.96epoch左右断了。有没有什么办法?我是单图的。 > > 好像是固定某张图造成的?
> [ehartford](/ehartford) Hello Eric, I just encountered the same issue as you. Have you resolved it?
> did you solve this? Did you solve this? Several of my machine encountered this issue at the same time.
> > did you solve this? > > Did you solve this? Several of my machine encountered this issue at the same time. Reinstalling the conda env can solve this.
补充 auto_gptq 0.7.1 transformers 4.42.2 Yi-34B gptq是正常的,Qwen2 72B不行
我修改了 init_kwargs["max_memory"] = {0: "20GIB", 1: "20GIB", 2: "20GIB", 3:"20GIB", 'cpu': "250GIB"} 机器RAM 251 GB, 4x48G A40
> @fzp0424 您好 请问问题有解决吗 换大RAM机器,把量化过程全都搬到CPU上(速度会慢)or 用4卡A100