Zengjie Hu(胡增杰) comments

Results 16 comments of


                                            Zengjie Hu(胡增杰)

Why does it take such a long time to perform SFT using LoRA?

但是我训练的时候还是会oom，而且batch size只能都设置成1了，数据太长了，cutoff len都设置成65536才行，请教一下，这种情况下也不需要开z3吗？ > lora为啥要开z3，看你也不缺显存吧 @Kuangdd01

Why does it take such a long time to perform SFT using LoRA?

> 噢噢对于这么长的序列那确实需要开，如果怀疑是进程有问题的话把这些问题pid记录一下用py-spy看看具体在执行什么呢你好，我按照你的方法，使用py-spy来查看这些进程是在做什么，结果如下：这个并没有在执行任务，但是我的nvidia-smi执行显示GPU利用率都是100%，然后我kill掉这些进程之后再执行nvidia-smi就会发现GPU被释放了，没有进程在执行了，所以能确定就是这些进程在占用我的GPU。现在的情况是，当我第一次开启训练时，然后就会多出很多的进程，而且当我ctrl+C结束掉我的训练进程之后，还是会有很多进程依然存在，导致我的GPU利用一直是100%，我怀疑是库的代码中导致了多启动了很多的进程？或者是对数据预处理之后没有正确关闭处理进程？请问你们能确认一下是不是有这个问题吗？谢谢 @Kuangdd01 @hiyouga

Why does it take such a long time to perform SFT using LoRA?

而且当我结束第一次训练进程之后，把残存的进程都kill，再从checkpoint重启我的训练，发现并没有那么多的进程了，而且这个时候再使用ctrl+C结束训练进程，就不会仍然存有进程了，

DPO显存分布不均匀

一样的问题，Qwen2.5VL-7B-Instruct是一个lora进行SFT，到一定步数就突然out of memory, batch size和gradient_accumulation_steps都只能设置为1，8卡 A800 显存80G，但是训练的时候一个GPU突然显存占比很大就out of memory了： GPU 5 Memory Allocated (%) 99.98500021922756 GPU 6 Memory Allocated (%) 78.07762809620475 GPU 2 Memory Allocated (%) 62.247705610456464 GPU 3 Memory...

Error when excuting run.py with attack_strategy=escape/naive/ignore and defense=no

And I want to ask another question: Regarding the gigaword dataset, are there alternative download links available? My cluster proxy cannot access Google Drive, so I need to use a...

Error when excuting run.py with attack_strategy=escape/naive/ignore and defense=no

> Hi [@FloSophorae](https://github.com/FloSophorae) > > First issue: When attack_strategy=escape/naive/ignore, the attacker does not need to have background information of the target task. Thus, you need to remove "target_task" from the...