snowfall
snowfall copied to clipboard
Debug possible memory leaks
We need to debug possible memory leaks in k2, in master and also the arc_scores branch from my repo (see a PR on k2 master). (I'm pretty sure arc_scores branch leaks; not 100% sure about master).
(1) Monitor memory usage from nvidia-smi and/or adding statements to the training script, and verify whether memory seems to increase.
(2) Look for available diagnostics from torch; I believe it can print out info about blocks allocated.
or
(3) [somewhat alternative to (2)] Add print statements to the constructor and destructor(/equivalents) of the Fsa object to check that the same number are constructed/destroyed on each iter. If those are not the same: add similar print statements to the Function objects in autograd.py and see which of them are not destroyed. I suspect it's a version of a previous issue where we had reference cycles between Fsa objects and Function objects used in backprop.
BTW, these leaks may actually be in the decoding script. Should probably check both train and decode (that memory usage doesn't systematically increase with iteration.)
I am running mmi_bigram_train.py. There is no OOM after processing 1850 batches. nvidia-smi shows the used RAM is about 19785 MB.
The key thing is whether ti systematically increases over time.
torch.cuda.get_device_properties(0).total_memory may help
On Sun, Jan 17, 2021 at 4:58 PM Fangjun Kuang [email protected] wrote:
I am running mmi_bigram_train.py. There is no OOM after processing 1850 batches. nvidia-smi shows the used RAM is about 19785 MB.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761757276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6BOQJC76ALXNKLJILS2KRDRANCNFSM4WF2S6CA .
torch.cuda.get_device_properties(0).total_memory returns the property of the device, e.g., its capacity, which is a constant.
From https://pytorch.org/docs/stable/cuda.html#torch.cuda.memory_allocated, I am using
torch.cuda.memory_allocated(0) / 1024. # KB
The result is shown below. The allocated memory is increased monotonically. It is increased by about 500 KB every 10 batches.
The branch arc_scores is used.

The figure matches the output Allocated memory from torch.cuda.memory_summary(0).
OK. Try adding print statements to the Fsa object init/destructor (there is a way, I forget).. to see if they are properly released.
On Sun, Jan 17, 2021 at 6:38 PM Fangjun Kuang [email protected] wrote:
The figure matches the output Allocated memory from torch.cuda.memory_summary(0).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761768988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3AW2TBFE33X2QM6FDS2K43TANCNFSM4WF2S6CA .
def __del__(self):
print('inside Fsa destructor')
I think the above destructor will work.
I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output
matches id(den).
Look at the torch.cuda.memory_summary to see how many memory regions are allocated, that might give us a hint.
On Sun, Jan 17, 2021 at 7:09 PM Fangjun Kuang [email protected] wrote:
I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761772772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7TV6ZQFGB5PYLVJ4DS2LANPANCNFSM4WF2S6CA .
.. I mean the delta per minibatch.
On Sun, Jan 17, 2021 at 7:15 PM Daniel Povey [email protected] wrote:
Look at the torch.cuda.memory_summary to see how many memory regions are allocated, that might give us a hint.
On Sun, Jan 17, 2021 at 7:09 PM Fangjun Kuang [email protected] wrote:
I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761772772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7TV6ZQFGB5PYLVJ4DS2LANPANCNFSM4WF2S6CA .
The arc_scores branch is able to train and decode without OOM after 10 epochs:
2021-01-18 14:56:43,771 INFO [mmi_bigram_decode.py:296] %WER 10.45% [5493 / 52576, 801 ins, 487 del, 4205 sub ]
Great!!
On Mon, Jan 18, 2021 at 3:34 PM Fangjun Kuang [email protected] wrote:
The arc_scores branch is able to train and decode without OOM after 10 epochs:
2021-01-18 14:56:43,771 INFO [mmi_bigram_decode.py:296] %WER 10.45% [5493 / 52576, 801 ins, 487 del, 4205 sub ]
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-762047456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6HP6N3RNY77NPE6FDS2PQARANCNFSM4WF2S6CA .
When I train aishell1 with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs.

Which script are you running?
On Mon, Mar 1, 2021 at 6:49 PM ffhh [email protected] wrote:
When I train aishell with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs. [image: image] https://user-images.githubusercontent.com/8347017/109487121-c0aefb00-7abe-11eb-9512-42128021f3bb.png
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787851808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZRNIARAWLFHYPK443TBNWKHANCNFSM4WF2S6CA .
Also let us know how recent your k2, lhotse and snowfall versions are, e.g. the date of the last commit or possibly the k2 release number would help. At some point we had some memory leaks in k2.
On Mon, Mar 1, 2021 at 7:21 PM Daniel Povey [email protected] wrote:
Which script are you running?
On Mon, Mar 1, 2021 at 6:49 PM ffhh [email protected] wrote:
When I train aishell with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs. [image: image] https://user-images.githubusercontent.com/8347017/109487121-c0aefb00-7abe-11eb-9512-42128021f3bb.png
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787851808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZRNIARAWLFHYPK443TBNWKHANCNFSM4WF2S6CA .
I updated the three project last week.
And I modify the mmi_att_transformer_train.py to fit the aishell1.
please show your changes via PR.
On Monday, March 1, 2021, ffhh [email protected] wrote:
And I modify the mmi_att_transformer_train.py to fit the aishell1.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787880660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZNS74ANFB23RHNM2DTBN33BANCNFSM4WF2S6CA .
The PR #114