LOMO
LOMO copied to clipboard
LOMO: LOw-Memory Optimization
你好,请问是否支持量化的模型,比如gptq? 如果可以的话,按照比例计算的话,我有8张24g的显卡的话,用流水线并行,是不是可以lora 175b版本量化模型了? 谢谢~
Hi, I'd like to run a 65B llama with LOMO, what config should I use to run the training on a 8*RTX 3090 machine? It would be very nice if...
Hello friend! First of all, thanks for your amazing work!! If I use another data collator/dataset loader, would I still be able to train using the LOMO trainer class?
It is a classical idea to overlap the backward pass and the optimization step. PyTorch supports this overlapping in DDP and FSDP. For example, here are hooks in DDP https://github.com/pytorch/pytorch/tree/main/torch/distributed/algorithms/ddp_comm_hooks...
Is gradient accumulation still mathematically possible in this scheme? Because if so we could train 65b on a single 3090 in a day and a half
Does it means LOMO is 11 times faster than AdamW? 
Through comparative experiments, we found that what really reduces GPU memory is "torch.set_default_dtype(torch.float16)" and deepspeed. We used LLaMA-7B to conduct experiments, using { "zero_optimization":{ "stage": 0 }, "gradient_accumulation_steps": 1, "steps_per_print":...
个人感觉全参数FT还是会比LoRA这种Adapter的效果要好的,那为什么LOMO没有火起来呢?个人已经试过2张24GB的显卡用LOMO FT一个7B的BLOOM,感觉整体流程还蛮丝滑的,为什么在各个平台搜不到太多用LOMO的人呢,好奇怪。