snowfall
snowfall copied to clipboard
Memory blowup
Right now we are dealing with an issue in train.py where it uses more and more memory. It seems like stuff isn't getting freed that should be getting freed.
This seems to be some kind of circular reference between the Python Fsa object and the function object for get_tot_scores (or its ctx), whereby an Fsa and the most recent _GetTotScoresFunction used on it are not deleted. Still debugging. Can be reproduced just in get_tot_scores_test.py.
It's strange that Fsa._grad_cache is not kept in the leaked memory?
(Pdb) leftover[2].byrcs[6].referrers.byrcs[0].referents
Partition of a set of 4 objects. Total size = 900 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 2 50 816 91 816 91 collections.OrderedDict
1 1 25 56 6 872 97 _k2.RaggedArc
2 1 25 28 3 900 100 int
(Pdb) leftover[2].byrcs[6].referrers.byrcs[0].referents.byvia
Partition of a set of 4 objects. Total size = 900 bytes.
Index Count % Size % Cumulative % Referred Via:
0 1 25 424 47 424 47 "['_tensor_attr']"
1 1 25 392 44 816 91 "['_non_tensor_attr']"
2 1 25 56 6 872 97 "['arcs']"
3 1 25 28 3 900 100 "['_properties']"
It maybe just isn't being printed for some reason. It seems the problem is that the Fsa has the attribute e.g. 'tot_scores_tropical' which has a grad_fn _GetTotScoresFunctionBackward, which has a reference to the Fsa in its ctx.
So what if we invoke del ctx.fsa inside the backward function to break the circular reference chain?
That's an interesting idea but I don't like the solution because it will cause a leak if someone doesn't end up calling backward, e.g. because of a problem that required abandoning the minibatch.
I am leaning towards, in the short term, just not having the FSA cache the total scores. Also, how do you guys think about renaming update_xxx to get_xxx? It seems to me that they are not just updating them, they are also returning them, so get is a better name.
Renaming is fine with me.
I am working on this.