FBK-Fairseq-ST icon indicating copy to clipboard operation
FBK-Fairseq-ST copied to clipboard

RuntimeError: CUDA out of memory after training 1 epoch

Open balag59 opened this issue 4 years ago • 10 comments

@mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks!

balag59 avatar Jun 10 '20 22:06 balag59

Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way?

Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan [email protected] ha scritto:

@mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA .

mattiadg avatar Jun 10 '20 22:06 mattiadg

Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan [email protected] ha scritto: @mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA .

@mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though).

balag59 avatar Jun 11 '20 04:06 balag59

I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem.

Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan [email protected] ha scritto:

Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan [email protected] ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA .

@mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642403741, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA .

mattiadg avatar Jun 11 '20 08:06 mattiadg

I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan [email protected] ha scritto: Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA .

Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)?

balag59 avatar Jun 11 '20 09:06 balag59

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence.

Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan [email protected] ha scritto:

I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan [email protected] ha scritto: … <#m_-3582157314581188595_> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8 <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642403741>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA .

Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642537417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA .

mattiadg avatar Jun 11 '20 09:06 mattiadg

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan [email protected] ha scritto: I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @.*** ha scritto: … <#m_-3582157314581188595_> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> <#8 <#8>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) <#8 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA .

Thank you so much! This helps!

balag59 avatar Jun 11 '20 10:06 balag59

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan [email protected] ha scritto: I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @.*** ha scritto: … <#m_-3582157314581188595_> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> <#8 <#8>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) <#8 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA .

I'm sorry but doesn't max-tokens stand for the maximum number of audio frames that can be loaded in a single GPU for every iteration? I thought it did.

balag59 avatar Jun 11 '20 10:06 balag59

Oops, yes my mistake. Sorry, I haven't been using this code for a while now. The problem is that it can load more segments than the ones actually used in a single iteration, if it loads more segments than - - max-sentences, so when it is too high it just occupies gpu memory.

Il gio 11 giu 2020, 12:54 Balaji Radhakrishnan [email protected] ha scritto:

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan [email protected] ha scritto: … <#m_-7013085008636611461_> I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @.*** ha scritto: … <#m_-3582157314581188595_> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8 <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8> <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8 <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8 (comment) <#8 (comment) https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642403741>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642537417>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA .

I'm sorry but doesn't max-tokens stand for the maximum number of audio frames that can be loaded in a single GPU for every iteration? I thought it did.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642568911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIR7FG4YXAFQRBMGAQTRWCZW3ANCNFSM4N2Z5DJA .

mattiadg avatar Jun 11 '20 12:06 mattiadg

Oops, yes my mistake. Sorry, I haven't been using this code for a while now. The problem is that it can load more segments than the ones actually used in a single iteration, if it loads more segments than - - max-sentences, so when it is too high it just occupies gpu memory. Il gio 11 giu 2020, 12:54 Balaji Radhakrishnan [email protected] ha scritto: --max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan @.*** ha scritto: … <#m_-7013085008636611461_> I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @.*** ha scritto: … <#m_-3582157314581188595_> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m_4792665824052452770_> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> <#8 <#8>> <#8 <#8> <#8 <#8>>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> (comment) <#8 (comment) <#8 (comment)>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) <#8 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA . I'm sorry but doesn't max-tokens stand for the maximum number of audio frames that can be loaded in a single GPU for every iteration? I thought it did. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIR7FG4YXAFQRBMGAQTRWCZW3ANCNFSM4N2Z5DJA .

Thanks that makes sense! Speaking of the code, is there a possibility that you will be releasing the code from your latest paper which brings in improvements like knowledge distillation?

balag59 avatar Jun 11 '20 12:06 balag59

@mattiadg Any updates on the possibility of releasing code from the latest paper?

balag59 avatar Jun 20 '20 21:06 balag59