zero123 icon indicating copy to clipboard operation
zero123 copied to clipboard

Zero123 not working on A40 GPU 46GB ram

Open Bhavay-2001 opened this issue 10 months ago • 3 comments

Hello authors,

I'm running the training script which is the main.py file. I have 2 A40 GPUs each having 46GB of memory. I reduced the batch size to even 1. When I set the num_workers to 0, the code just abruptly stops at Epoch 0. If I set it to some value, then it throws the error of "RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 32823, 32919) exited unexpectedly".

I saw the solution to this online and people said to put num_workers to 0 but that again doesn't solve the problem as stated above. Can you please tell what is the issue?

Edit - Value of parameter accumulate_grad_batches = 1 in my case. Should I change it to 4?

Bhavay-2001 avatar Aug 30 '23 19:08 Bhavay-2001

I am getting this error too -- any idea why did that happen?

kalyani7195 avatar Oct 04 '23 07:10 kalyani7195

me too, I am also getting this error -- anyone know how to fix it?

cooperrfeng avatar Oct 09 '23 08:10 cooperrfeng

have you fix it?

gitsawww avatar Oct 31 '23 04:10 gitsawww