snowfall Debug possible memory leaks

We need to debug possible memory leaks in k2, in master and also the arc_scores branch from my repo (see a PR on k2 master). (I'm pretty sure arc_scores branch leaks; not 100% sure about master).

(1) Monitor memory usage from nvidia-smi and/or adding statements to the training script, and verify whether memory seems to increase.

(2) Look for available diagnostics from torch; I believe it can print out info about blocks allocated.

or

(3) [somewhat alternative to (2)] Add print statements to the constructor and destructor(/equivalents) of the Fsa object to check that the same number are constructed/destroyed on each iter. If those are not the same: add similar print statements to the Function objects in autograd.py and see which of them are not destroyed. I suspect it's a version of a previous issue where we had reference cycles between Fsa objects and Function objects used in backprop.

Jan 17 '21 05:01 danpovey

BTW, these leaks may actually be in the decoding script. Should probably check both train and decode (that memory usage doesn't systematically increase with iteration.)

Jan 17 '21 08:01 danpovey

I am running mmi_bigram_train.py. There is no OOM after processing 1850 batches. nvidia-smi shows the used RAM is about 19785 MB.

Jan 17 '21 08:01 csukuangfj

The key thing is whether ti systematically increases over time.

torch.cuda.get_device_properties(0).total_memory may help

On Sun, Jan 17, 2021 at 4:58 PM Fangjun Kuang [email protected] wrote:

I am running mmi_bigram_train.py. There is no OOM after processing 1850 batches. nvidia-smi shows the used RAM is about 19785 MB.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761757276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6BOQJC76ALXNKLJILS2KRDRANCNFSM4WF2S6CA .

Jan 17 '21 09:01 danpovey

torch.cuda.get_device_properties(0).total_memory returns the property of the device, e.g., its capacity, which is a constant.

From https://pytorch.org/docs/stable/cuda.html#torch.cuda.memory_allocated, I am using

torch.cuda.memory_allocated(0) / 1024.  # KB

The result is shown below. The allocated memory is increased monotonically. It is increased by about 500 KB every 10 batches.

The branch arc_scores is used.

Screen Shot 2021-01-17 at 18 29 50

Jan 17 '21 10:01 csukuangfj

The figure matches the output Allocated memory from torch.cuda.memory_summary(0).

Jan 17 '21 10:01 csukuangfj

OK. Try adding print statements to the Fsa object init/destructor (there is a way, I forget).. to see if they are properly released.

On Sun, Jan 17, 2021 at 6:38 PM Fangjun Kuang [email protected] wrote:

The figure matches the output Allocated memory from torch.cuda.memory_summary(0).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761768988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3AW2TBFE33X2QM6FDS2K43TANCNFSM4WF2S6CA .

Jan 17 '21 10:01 danpovey

def __del__(self):
  print('inside Fsa destructor')

I think the above destructor will work.

Jan 17 '21 10:01 csukuangfj

I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).

Jan 17 '21 11:01 csukuangfj

Look at the torch.cuda.memory_summary to see how many memory regions are allocated, that might give us a hint.

On Sun, Jan 17, 2021 at 7:09 PM Fangjun Kuang [email protected] wrote:

I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761772772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7TV6ZQFGB5PYLVJ4DS2LANPANCNFSM4WF2S6CA .

Jan 17 '21 11:01 danpovey

.. I mean the delta per minibatch.

On Sun, Jan 17, 2021 at 7:15 PM Daniel Povey [email protected] wrote:

Look at the torch.cuda.memory_summary to see how many memory regions are allocated, that might give us a hint.

On Sun, Jan 17, 2021 at 7:09 PM Fangjun Kuang [email protected] wrote:

I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761772772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7TV6ZQFGB5PYLVJ4DS2LANPANCNFSM4WF2S6CA .

Jan 17 '21 11:01 danpovey

The arc_scores branch is able to train and decode without OOM after 10 epochs:

2021-01-18 14:56:43,771 INFO [mmi_bigram_decode.py:296] %WER 10.45% [5493 / 52576, 801 ins, 487 del, 4205 sub ]

Jan 18 '21 07:01 csukuangfj

Great!!

On Mon, Jan 18, 2021 at 3:34 PM Fangjun Kuang [email protected] wrote:

The arc_scores branch is able to train and decode without OOM after 10 epochs:

2021-01-18 14:56:43,771 INFO [mmi_bigram_decode.py:296] %WER 10.45% [5493 / 52576, 801 ins, 487 del, 4205 sub ]

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-762047456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6HP6N3RNY77NPE6FDS2PQARANCNFSM4WF2S6CA .

Jan 18 '21 08:01 danpovey

When I train aishell1 with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs.

Mar 01 '21 10:03 eeewhe

Which script are you running?

On Mon, Mar 1, 2021 at 6:49 PM ffhh [email protected] wrote:

When I train aishell with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs. [image: image] https://user-images.githubusercontent.com/8347017/109487121-c0aefb00-7abe-11eb-9512-42128021f3bb.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787851808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZRNIARAWLFHYPK443TBNWKHANCNFSM4WF2S6CA .

Mar 01 '21 11:03 danpovey

Also let us know how recent your k2, lhotse and snowfall versions are, e.g. the date of the last commit or possibly the k2 release number would help. At some point we had some memory leaks in k2.

On Mon, Mar 1, 2021 at 7:21 PM Daniel Povey [email protected] wrote:

Which script are you running?

On Mon, Mar 1, 2021 at 6:49 PM ffhh [email protected] wrote:

When I train aishell with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs. [image: image] https://user-images.githubusercontent.com/8347017/109487121-c0aefb00-7abe-11eb-9512-42128021f3bb.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787851808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZRNIARAWLFHYPK443TBNWKHANCNFSM4WF2S6CA .

Mar 01 '21 11:03 danpovey

I updated the three project last week.

Mar 01 '21 11:03 eeewhe

And I modify the mmi_att_transformer_train.py to fit the aishell1.

Mar 01 '21 11:03 eeewhe

please show your changes via PR.

On Monday, March 1, 2021, ffhh [email protected] wrote:

And I modify the mmi_att_transformer_train.py to fit the aishell1.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787880660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZNS74ANFB23RHNM2DTBN33BANCNFSM4WF2S6CA .

Mar 01 '21 13:03 danpovey

The PR #114

Mar 02 '21 02:03 eeewhe

snowfall snowfall copied to clipboard

Debug possible memory leaks

snowfall
snowfall copied to clipboard