snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

Debug possible memory leaks

Open danpovey opened this issue 4 years ago • 19 comments

We need to debug possible memory leaks in k2, in master and also the arc_scores branch from my repo (see a PR on k2 master). (I'm pretty sure arc_scores branch leaks; not 100% sure about master).

(1) Monitor memory usage from nvidia-smi and/or adding statements to the training script, and verify whether memory seems to increase.

(2) Look for available diagnostics from torch; I believe it can print out info about blocks allocated.

or

(3) [somewhat alternative to (2)] Add print statements to the constructor and destructor(/equivalents) of the Fsa object to check that the same number are constructed/destroyed on each iter. If those are not the same: add similar print statements to the Function objects in autograd.py and see which of them are not destroyed. I suspect it's a version of a previous issue where we had reference cycles between Fsa objects and Function objects used in backprop.

danpovey avatar Jan 17 '21 05:01 danpovey

BTW, these leaks may actually be in the decoding script. Should probably check both train and decode (that memory usage doesn't systematically increase with iteration.)

danpovey avatar Jan 17 '21 08:01 danpovey

I am running mmi_bigram_train.py. There is no OOM after processing 1850 batches. nvidia-smi shows the used RAM is about 19785 MB.

csukuangfj avatar Jan 17 '21 08:01 csukuangfj

The key thing is whether ti systematically increases over time.

torch.cuda.get_device_properties(0).total_memory may help

On Sun, Jan 17, 2021 at 4:58 PM Fangjun Kuang [email protected] wrote:

I am running mmi_bigram_train.py. There is no OOM after processing 1850 batches. nvidia-smi shows the used RAM is about 19785 MB.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761757276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6BOQJC76ALXNKLJILS2KRDRANCNFSM4WF2S6CA .

danpovey avatar Jan 17 '21 09:01 danpovey

torch.cuda.get_device_properties(0).total_memory returns the property of the device, e.g., its capacity, which is a constant.


From https://pytorch.org/docs/stable/cuda.html#torch.cuda.memory_allocated, I am using

torch.cuda.memory_allocated(0) / 1024.  # KB

The result is shown below. The allocated memory is increased monotonically. It is increased by about 500 KB every 10 batches.


The branch arc_scores is used.

Screen Shot 2021-01-17 at 18 29 50

csukuangfj avatar Jan 17 '21 10:01 csukuangfj

The figure matches the output Allocated memory from torch.cuda.memory_summary(0).

csukuangfj avatar Jan 17 '21 10:01 csukuangfj

OK. Try adding print statements to the Fsa object init/destructor (there is a way, I forget).. to see if they are properly released.

On Sun, Jan 17, 2021 at 6:38 PM Fangjun Kuang [email protected] wrote:

The figure matches the output Allocated memory from torch.cuda.memory_summary(0).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761768988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3AW2TBFE33X2QM6FDS2K43TANCNFSM4WF2S6CA .

danpovey avatar Jan 17 '21 10:01 danpovey

def __del__(self):
  print('inside Fsa destructor')

I think the above destructor will work.

csukuangfj avatar Jan 17 '21 10:01 csukuangfj

I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).

csukuangfj avatar Jan 17 '21 11:01 csukuangfj

Look at the torch.cuda.memory_summary to see how many memory regions are allocated, that might give us a hint.

On Sun, Jan 17, 2021 at 7:09 PM Fangjun Kuang [email protected] wrote:

I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761772772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7TV6ZQFGB5PYLVJ4DS2LANPANCNFSM4WF2S6CA .

danpovey avatar Jan 17 '21 11:01 danpovey

.. I mean the delta per minibatch.

On Sun, Jan 17, 2021 at 7:15 PM Daniel Povey [email protected] wrote:

Look at the torch.cuda.memory_summary to see how many memory regions are allocated, that might give us a hint.

On Sun, Jan 17, 2021 at 7:09 PM Fangjun Kuang [email protected] wrote:

I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761772772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7TV6ZQFGB5PYLVJ4DS2LANPANCNFSM4WF2S6CA .

danpovey avatar Jan 17 '21 11:01 danpovey

The arc_scores branch is able to train and decode without OOM after 10 epochs:

2021-01-18 14:56:43,771 INFO [mmi_bigram_decode.py:296] %WER 10.45% [5493 / 52576, 801 ins, 487 del, 4205 sub ]

csukuangfj avatar Jan 18 '21 07:01 csukuangfj

Great!!

On Mon, Jan 18, 2021 at 3:34 PM Fangjun Kuang [email protected] wrote:

The arc_scores branch is able to train and decode without OOM after 10 epochs:

2021-01-18 14:56:43,771 INFO [mmi_bigram_decode.py:296] %WER 10.45% [5493 / 52576, 801 ins, 487 del, 4205 sub ]

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-762047456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6HP6N3RNY77NPE6FDS2PQARANCNFSM4WF2S6CA .

danpovey avatar Jan 18 '21 08:01 danpovey

When I train aishell1 with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs. image

eeewhe avatar Mar 01 '21 10:03 eeewhe

Which script are you running?

On Mon, Mar 1, 2021 at 6:49 PM ffhh [email protected] wrote:

When I train aishell with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs. [image: image] https://user-images.githubusercontent.com/8347017/109487121-c0aefb00-7abe-11eb-9512-42128021f3bb.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787851808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZRNIARAWLFHYPK443TBNWKHANCNFSM4WF2S6CA .

danpovey avatar Mar 01 '21 11:03 danpovey

Also let us know how recent your k2, lhotse and snowfall versions are, e.g. the date of the last commit or possibly the k2 release number would help. At some point we had some memory leaks in k2.

On Mon, Mar 1, 2021 at 7:21 PM Daniel Povey [email protected] wrote:

Which script are you running?

On Mon, Mar 1, 2021 at 6:49 PM ffhh [email protected] wrote:

When I train aishell with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs. [image: image] https://user-images.githubusercontent.com/8347017/109487121-c0aefb00-7abe-11eb-9512-42128021f3bb.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787851808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZRNIARAWLFHYPK443TBNWKHANCNFSM4WF2S6CA .

danpovey avatar Mar 01 '21 11:03 danpovey

I updated the three project last week.

eeewhe avatar Mar 01 '21 11:03 eeewhe

And I modify the mmi_att_transformer_train.py to fit the aishell1.

eeewhe avatar Mar 01 '21 11:03 eeewhe

please show your changes via PR.

On Monday, March 1, 2021, ffhh [email protected] wrote:

And I modify the mmi_att_transformer_train.py to fit the aishell1.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787880660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZNS74ANFB23RHNM2DTBN33BANCNFSM4WF2S6CA .

danpovey avatar Mar 01 '21 13:03 danpovey

The PR #114

eeewhe avatar Mar 02 '21 02:03 eeewhe