InstantMesh icon indicating copy to clipboard operation
InstantMesh copied to clipboard

Training CUDA Out Of Memory error

Open Kev1MSL opened this issue 1 year ago • 10 comments

Hi! I am trying to train the instantmesh model but I am currently facing issues just before the backpropagation where I am getting cuda out of memory error. Have you faced a similar issue when training and how did you solve this? I am also training on 8 GPUs with same memory as H800, as explained in the paper. Thanks!

Kev1MSL avatar May 30 '24 16:05 Kev1MSL

Please check your cuda devices .

sumanttyagi avatar May 31 '24 03:05 sumanttyagi

I am using a single A800(80G), but I can only train it with batch_size=1, if I set batch_size=2, there also would be a cuda out of memory error. image image

gaodalii avatar May 31 '24 04:05 gaodalii

Yes same thing, when I set batch_size=1 it works, but batch_size=2 it does not. However I am only missing a few GB (~2GB), so I was wondering if there is a way to optimize this? And also what happens if I want to distribute the training across multiple gpus, if I set batch_size=1, is it going to be 1 batch per GPU? Or the 1 batch will be distributed across the GPUs?

Because if it is a batch of size 1, then wouldn't we have issue with converging?

Kev1MSL avatar May 31 '24 11:05 Kev1MSL

@Kev1MSL Hello, I encountered several problems in the training process, the structure of my dataset is as the picture says, but my training profile will not be written, I would like to ask for your help, thank you very much for your reply

微信图片_20240531213352 微信图片_20240531213146

Mrguanglei avatar May 31 '24 13:05 Mrguanglei

@Kev1MSL hello,i am trying to run the training process,but i don't know how to construct the dataset ,can i have a look at the structure of dataset?thank you very much for your reply

throb081 avatar Jun 04 '24 07:06 throb081

@Kev1MSL Hello, may I ask have you made any change to the code ? because I am training the model on the A100 GPU, not even able to train with batch size =1. image

fffh1 avatar Jul 05 '24 11:07 fffh1

@fffh1 Hello, did you solve it? I meet the same problem

ustbzgn avatar Jul 13 '24 08:07 ustbzgn

Hi Check your depth image dimension, shuold be one rather than rgb or rgba. Regards, Feng


From: ustbzgn @.> Sent: Saturday, July 13, 2024 6:26 PM To: TencentARC/InstantMesh @.> Cc: feng hu @.>; Mention @.> Subject: Re: [TencentARC/InstantMesh] Training CUDA Out Of Memory error (Issue #98)

@fffh1https://github.com/fffh1 Hello, did you solve it? I meet the same problem

— Reply to this email directly, view it on GitHubhttps://github.com/TencentARC/InstantMesh/issues/98#issuecomment-2226820606, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU7EZSWPZJNYTDGBX7RPCOLZMDQEPAVCNFSM6AAAAABIRJIL2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWHAZDANRQGY. You are receiving this because you were mentioned.Message ID: @.***>

fffh1 avatar Jul 13 '24 10:07 fffh1

I am very thanks

---- Replied Message ---- | From | feng @.> | | Date | 07/13/2024 18:07 | | To | TencentARC/InstantMesh @.> | | Cc | ustbzgn @.>, Comment @.> | | Subject | Re: [TencentARC/InstantMesh] Training CUDA Out Of Memory error (Issue #98) |

Hi Check your depth image dimension, shuold be one rather than rgb or rgba. Regards, Feng


From: ustbzgn @.> Sent: Saturday, July 13, 2024 6:26 PM To: TencentARC/InstantMesh @.> Cc: feng hu @.>; Mention @.> Subject: Re: [TencentARC/InstantMesh] Training CUDA Out Of Memory error (Issue #98)

@fffh1https://github.com/fffh1 Hello, did you solve it? I meet the same problem

— Reply to this email directly, view it on GitHubhttps://github.com/TencentARC/InstantMesh/issues/98#issuecomment-2226820606, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AU7EZSWPZJNYTDGBX7RPCOLZMDQEPAVCNFSM6AAAAABIRJIL2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWHAZDANRQGY. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

ustbzgn avatar Jul 14 '24 14:07 ustbzgn

Yes same thing, when I set batch_size=1 it works, but batch_size=2 it does not. However I am only missing a few GB (~2GB), so I was wondering if there is a way to optimize this? And also what happens if I want to distribute the training across multiple gpus, if I set batch_size=1, is it going to be 1 batch per GPU? Or the 1 batch will be distributed across the GPUs?

Because if it is a batch of size 1, then wouldn't we have issue with converging?

Hello, may I ask how you figure this 'cuda out of memory' issue? I meet the same question while validating stage, but seems confused to solve it.

Jinyiyi3 avatar Mar 27 '25 05:03 Jinyiyi3