LISA icon indicating copy to clipboard operation
LISA copied to clipboard

training on fewer gpus

Open AmeenAli opened this issue 1 year ago • 6 comments

Hi!

I am trying to train with 4 (24GB) GPU cards instead of 8 as suggested, but the code seems to fail always at the line : https://github.com/dvlab-research/LISA/blob/main/train_ds.py#L305

with the error CUDA OOM, how can i reconfigure the settings in order to enable slower training with fewer GPUs?

Thanks!

AmeenAli avatar Sep 15 '23 07:09 AmeenAli

Can you lower the batch_size and then increase the grad_accumulation_steps to make sure their product keeps the same?

X-Lai avatar Sep 16 '23 15:09 X-Lai

I have tried, but seems like the code crashes at the same place everytime

https://github.com/dvlab-research/LISA/blob/main/train_ds.py#L305

AmeenAli avatar Sep 16 '23 16:09 AmeenAli

I have tried, but seems like the code crashes at the same place everytime

https://github.com/dvlab-research/LISA/blob/main/train_ds.py#L305

I have the same problem, after change the sam huge model to big model, the oom is fixed.

xbkaishui avatar Sep 19 '23 02:09 xbkaishui

I added offloading to cpu and it helped

AmrinKareem avatar Nov 01 '23 11:11 AmrinKareem

I have the same problem, has anyone solved it?

Wbzb avatar Jan 10 '24 08:01 Wbzb

I added offloading to cpu and it helped

Please, how do you offload to cpu to solve it? I try to use the following code but OOM again.

      "offload_optimizer": {
          "device": "cpu",
          "pin_memory": True
        },
        "offload_param": {
          "device": "cpu",
          "pin_memory": True
        },

Wbzb avatar Jan 10 '24 08:01 Wbzb