GroupViT icon indicating copy to clipboard operation
GroupViT copied to clipboard

As the training progresses, the memory required increases

Open QianyiLiu22 opened this issue 4 years ago • 6 comments

Hi, thanks for your great work. In the process of training, I found that the memory usage gradually increased until it was out of memory. There are 252G memory in my server. The training dataset is gcc12m, the val dataset is VOC2012. The Traceback message: RuntimeError:DataLoader worker(pid xxxxx) is killed by signal:killed. Could you please help me solve it?

QianyiLiu22 avatar Apr 04 '22 11:04 QianyiLiu22

Hi @acver1 I haven't encountered this issue before. Maybe you could try to reduce the worker number to see if the error still exists.

xvjiarui avatar Apr 05 '22 01:04 xvjiarui

I'm encountering the same issue. Training was killed because of CPU OOM after ~7000 inters of training. Our machine has 512G memory. @acver1 Were you able to solve this issue?

Thanks!

ZeWang95 avatar May 02 '22 07:05 ZeWang95

I'm encountering the same issue. Training was killed because of CPU OOM after ~7000 inters of training. Our machine has 512G memory. @acver1 Were you able to solve this issue?

Thanks!

Hello@ZeWang95! I haven't solve it yet. Someone else told me that it might because of the Version of WebDataset,but i haven't try it yet.

QianyiLiu22 avatar May 02 '22 07:05 QianyiLiu22

I'm encountering the same issue. Training was killed because of CPU OOM after ~7000 inters of training. Our machine has 512G memory. @acver1 Were you able to solve this issue? Thanks!

Hello@ZeWang95! I haven't solve it yet. Someone else told me that it might because of the Version of WebDataset,but i haven't try it yet.

I was thinking about the same and trying webdataset==0.2.5. Looks like a lot of work needs to be done to support the latest version of webdataset. @xvjiarui Would it be possible to add support to webdataset==0.2.5? This can be super helpful.

Thanks!

ZeWang95 avatar May 02 '22 07:05 ZeWang95

Hi @ZeWang95 webdataset 0.1 and 0.2 may not compatible in some APIs. And we haven't verify the accuracy of webdataset 0.2.

xvjiarui avatar May 02 '22 16:05 xvjiarui

I'm encountering the same issue. Training was killed because of CPU OOM after ~7000 inters of training. Our machine has 512G memory. @acver1 Were you able to solve this issue? Thanks!

Hello@ZeWang95! I haven't solve it yet. Someone else told me that it might because of the Version of WebDataset,but i haven't try it yet.

I was thinking about the same and trying webdataset==0.2.5. Looks like a lot of work needs to be done to support the latest version of webdataset. @xvjiarui Would it be possible to add support to webdataset==0.2.5? This can be super helpful.

Thanks!

@ZeWang95 Have you tested webdataset==0.2.5 ? Or, would you like to provide some suggestions for this work? Thanks a lot!

slyviacassell avatar Jun 15 '22 10:06 slyviacassell