As the training progresses, the memory required increases
Hi, thanks for your great work. In the process of training, I found that the memory usage gradually increased until it was out of memory. There are 252G memory in my server. The training dataset is gcc12m, the val dataset is VOC2012. The Traceback message: RuntimeError:DataLoader worker(pid xxxxx) is killed by signal:killed. Could you please help me solve it?
Hi @acver1 I haven't encountered this issue before. Maybe you could try to reduce the worker number to see if the error still exists.
I'm encountering the same issue. Training was killed because of CPU OOM after ~7000 inters of training. Our machine has 512G memory. @acver1 Were you able to solve this issue?
Thanks!
I'm encountering the same issue. Training was killed because of CPU OOM after ~7000 inters of training. Our machine has 512G memory. @acver1 Were you able to solve this issue?
Thanks!
Hello@ZeWang95! I haven't solve it yet. Someone else told me that it might because of the Version of WebDataset,but i haven't try it yet.
I'm encountering the same issue. Training was killed because of CPU OOM after ~7000 inters of training. Our machine has 512G memory. @acver1 Were you able to solve this issue? Thanks!
Hello@ZeWang95! I haven't solve it yet. Someone else told me that it might because of the Version of WebDataset,but i haven't try it yet.
I was thinking about the same and trying webdataset==0.2.5. Looks like a lot of work needs to be done to support the latest version of webdataset. @xvjiarui Would it be possible to add support to webdataset==0.2.5? This can be super helpful.
Thanks!
Hi @ZeWang95 webdataset 0.1 and 0.2 may not compatible in some APIs. And we haven't verify the accuracy of webdataset 0.2.
I'm encountering the same issue. Training was killed because of CPU OOM after ~7000 inters of training. Our machine has 512G memory. @acver1 Were you able to solve this issue? Thanks!
Hello@ZeWang95! I haven't solve it yet. Someone else told me that it might because of the Version of WebDataset,but i haven't try it yet.
I was thinking about the same and trying webdataset==0.2.5. Looks like a lot of work needs to be done to support the latest version of webdataset. @xvjiarui Would it be possible to add support to webdataset==0.2.5? This can be super helpful.
Thanks!
@ZeWang95 Have you tested webdataset==0.2.5 ? Or, would you like to provide some suggestions for this work? Thanks a lot!