mmagic
mmagic copied to clipboard
Memory Leak in basicVSR++ (?)
Using distributed training on an 8xV100 machine with all configs from the REDS dataset and setting the samples_per_gpu up to two the training crashed after four hours as it ran out of memory. As you can see in the image is uses around 100GB for most of the time and then slowly creeps up to 420GB until one of the workers crashes due to an out of memory error.
Hello, I have never encountered this problem, could you tell us more about what command you used for training?
Closing this Issue due to no more feedback. Please feel free to reopen it if needed.