Guo Qipeng
Guo Qipeng
Thanks for your information, we will investigate it. Inplace updating is a classical engineering trick, and our intuition is to provide a solution for low-resource training. Our paper also discusses...
No, the gradient is no longer perseved in the GPU memory. If you offload the gradient tensor to the CPU memory or NVME, there is a large cost of transportation.
方法本身是模型无关的,所以理论上是支持量化过的模型的(只要还是用pytorch并且没有修改反向过程),但我们没有测试过。另外好像GPTQ是Post-Training Quantization方法,所以训练效果不一定有保证。
基本是这样,主要是整合了各种减少显存占用的技术,以及一些保持训练稳定性的技巧。
感谢您的问题,很有道理。但鉴于我们计算资源不是很充裕,adam相关的对比实验做得比较少(需要很大的显存来完成对比实验)。我们有在7B上做过一些小规模的测试,但不是很充分。基本是comparable的,具体在不同任务和不同数据上都会有差别。
和Adam的比较在我们的后续计划上,不过具体时间不好承诺。
Good question, we don't know how LOMO will perform in the pre-training stage. The major concern is that the SGD is sensitive to the optimization settings. My guess is that...
Hi, can you provide more training details, for example, the training loss curve? If the loss jitters a lot, maybe the best choice is to use a lower learning rate,...
1e-5 is common for Adam. The scale of learning rate for Adam is often much smaller than SGD since the Adam will rescale its learning rate before it has enough...
Please check this line, https://github.com/OpenLMLab/LOMO/blob/main/src/train_lomo.py#L107