Guo Qipeng comments

Results 13 comments of


                                            Guo Qipeng

What is the difference from official PyTorch DDP hooks?

Thanks for your information, we will investigate it. Inplace updating is a classical engineering trick, and our intuition is to provide a solution for low-resource training. Our paper also discusses...

Gradient accumulation

No, the gradient is no longer perseved in the GPU memory. If you offload the gradient tensor to the CPU memory or NVME, there is a large cost of transportation.

是否支持量化的模型呀？

方法本身是模型无关的，所以理论上是支持量化过的模型的（只要还是用pytorch并且没有修改反向过程），但我们没有测试过。另外好像GPTQ是Post-Training Quantization方法，所以训练效果不一定有保证。

是否支持量化的模型呀？

基本是这样，主要是整合了各种减少显存占用的技术，以及一些保持训练稳定性的技巧。

更充分实验，与Adam的实验效果进行比较

感谢您的问题，很有道理。但鉴于我们计算资源不是很充裕，adam相关的对比实验做得比较少（需要很大的显存来完成对比实验）。我们有在7B上做过一些小规模的测试，但不是很充分。基本是comparable的，具体在不同任务和不同数据上都会有差别。

更充分实验，与Adam的实验效果进行比较

和Adam的比较在我们的后续计划上，不过具体时间不好承诺。

Is LOMO capable of pre-training a LLM from scratch as well?

Good question, we don't know how LOMO will perform in the pre-training stage. The major concern is that the SGD is sensitive to the optimization settings. My guess is that...

Performance Model after Full Fine-tuning by LOMOTrainer

Hi, can you provide more training details, for example, the training loss curve? If the loss jitters a lot, maybe the best choice is to use a lower learning rate,...

Performance Model after Full Fine-tuning by LOMOTrainer

1e-5 is common for Adam. The scale of learning rate for Adam is often much smaller than SGD since the Adam will rescale its learning rate before it has enough...

Performance Model after Full Fine-tuning by LOMOTrainer

Please check this line, https://github.com/OpenLMLab/LOMO/blob/main/src/train_lomo.py#L107