LOMO issues

是否支持量化的模型呀？

4

你好，请问是否支持量化的模型，比如gptq？如果可以的话，按照比例计算的话，我有8张24g的显卡的话，用流水线并行，是不是可以lora 175b版本量化模型了？谢谢~

can you provide the running config of 65b models?

5

Hi, I'd like to run a 65B llama with LOMO, what config should I use to run the training on a 8*RTX 3090 machine? It would be very nice if...

cyz14

Train with other datasets collator/loader

1

Hello friend! First of all, thanks for your amazing work!! If I use another data collator/dataset loader, would I still be able to train using the LOMO trainer class?

CamaradaLares

What is the difference from official PyTorch DDP hooks?

1

It is a classical idea to overlap the backward pass and the optimization step. PyTorch supports this overlapping in DDP and FSDP. For example, here are hooks in DDP https://github.com/pytorch/pytorch/tree/main/torch/distributed/algorithms/ddp_comm_hooks...

wangkuiyi

Gradient accumulation

1

Is gradient accumulation still mathematically possible in this scheme? Because if so we could train 65b on a single 3090 in a day and a half

EladDv

time cost of 7b model training compared to AdamW

1

Does it means LOMO is 11 times faster than AdamW? ![image](https://github.com/OpenLMLab/LOMO/assets/10653991/5a8116c3-f9f2-4e8c-b3a9-ef4ace9dff70)

dawnranger

请问adalomo可以支持用transformer中的trainer训练么？或者未来有可能实现么？

16

lyt719

Serious conclusion: LOMO does not significantly reduce GPU memory usage！

17

Through comparative experiments, we found that what really reduces GPU memory is "torch.set_default_dtype(torch.float16)" and deepspeed. We used LLaMA-7B to conduct experiments, using { "zero_optimization":{ "stage": 0 }, "gradient_accumulation_steps": 1, "steps_per_print":...

misonsky

为什么LOMO并没有火起来呢？

5

个人感觉全参数FT还是会比LoRA这种Adapter的效果要好的，那为什么LOMO没有火起来呢？个人已经试过2张24GB的显卡用LOMO FT一个7B的BLOOM，感觉整体流程还蛮丝滑的，为什么在各个平台搜不到太多用LOMO的人呢，好奇怪。

Flywolfs

LOMO
LOMO copied to clipboard

Metadata

是否支持量化的模型呀？

can you provide the running config of 65b models?

Train with other datasets collator/loader

4张3090能训练llama13B么，我做了尝试但是失败了

What is the difference from official PyTorch DDP hooks?

Gradient accumulation

time cost of 7b model training compared to AdamW

请问adalomo可以支持用transformer中的trainer训练么？或者未来有可能实现么？

Serious conclusion: LOMO does not significantly reduce GPU memory usage！

为什么LOMO并没有火起来呢？

← Metadata

Owner

Metadata

LOMO LOMO copied to clipboard

Metadata

← Metadata

Owner

Metadata

LOMO
LOMO copied to clipboard