snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

Memory blowup

Open danpovey opened this issue 5 years ago • 8 comments

Right now we are dealing with an issue in train.py where it uses more and more memory. It seems like stuff isn't getting freed that should be getting freed.

danpovey avatar Nov 13 '20 13:11 danpovey

This seems to be some kind of circular reference between the Python Fsa object and the function object for get_tot_scores (or its ctx), whereby an Fsa and the most recent _GetTotScoresFunction used on it are not deleted. Still debugging. Can be reproduced just in get_tot_scores_test.py.

danpovey avatar Nov 14 '20 07:11 danpovey

It's strange that Fsa._grad_cache is not kept in the leaked memory?

(Pdb) leftover[2].byrcs[6].referrers.byrcs[0].referents
Partition of a set of 4 objects. Total size = 900 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      2  50      816  91       816  91 collections.OrderedDict
     1      1  25       56   6       872  97 _k2.RaggedArc
     2      1  25       28   3       900 100 int
(Pdb) leftover[2].byrcs[6].referrers.byrcs[0].referents.byvia
Partition of a set of 4 objects. Total size = 900 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1  25      424  47       424  47 "['_tensor_attr']"
     1      1  25      392  44       816  91 "['_non_tensor_attr']"
     2      1  25       56   6       872  97 "['arcs']"
     3      1  25       28   3       900 100 "['_properties']"

qindazhu avatar Nov 14 '20 07:11 qindazhu

It maybe just isn't being printed for some reason. It seems the problem is that the Fsa has the attribute e.g. 'tot_scores_tropical' which has a grad_fn _GetTotScoresFunctionBackward, which has a reference to the Fsa in its ctx.

danpovey avatar Nov 14 '20 07:11 danpovey

So what if we invoke del ctx.fsa inside the backward function to break the circular reference chain?

csukuangfj avatar Nov 14 '20 08:11 csukuangfj

That's an interesting idea but I don't like the solution because it will cause a leak if someone doesn't end up calling backward, e.g. because of a problem that required abandoning the minibatch.

danpovey avatar Nov 14 '20 08:11 danpovey

I am leaning towards, in the short term, just not having the FSA cache the total scores. Also, how do you guys think about renaming update_xxx to get_xxx? It seems to me that they are not just updating them, they are also returning them, so get is a better name.

danpovey avatar Nov 14 '20 08:11 danpovey

Renaming is fine with me.

csukuangfj avatar Nov 14 '20 08:11 csukuangfj

I am working on this.

danpovey avatar Nov 14 '20 08:11 danpovey