returnn icon indicating copy to clipboard operation
returnn copied to clipboard

Memory leak with pretraining

Open albertz opened this issue 3 years ago • 3 comments
trafficstars

We have the case (reported by @mmz33) that a config with pretraining increases memory way over 16GB, while the same config without pretraining (fixed network, no reconstruction) stays around 6GB.

We look at the RSS memory usage. The job is killed by SGE when over the specified limit (e.g. 16GB).

This might be related to the net construction, or recreation of a new TF session, and/or new TF graph.

@mmz33 or @JackTemaki might be able to provide some further logs or details.

albertz avatar Mar 14 '22 22:03 albertz

Small update: There was a small bug related to the handling of loss_opts and loss_scale (fixed here: https://github.com/rwth-i6/returnn/commit/e4be10e7a322420e8296f0104eeb6a5e9ba97330) which caused a difference in the net dict, so the check each sub epoch whether the net dict changed was always True, and thus it caused a net reconstruction every sub epoch. However, within pretrain repetitions, the net dict actually did not change, so many of those reconstructions were unnecessary. Now with the fix, the memory consumption looks already much better. It looks almost as if there is no leak anymore, or at least there is no real problem anymore. (Reported by @mmz33)

albertz avatar Mar 16 '22 09:03 albertz

To update: With the mentioned fix from above, we have not seen any problems anymore. It's probably not really fixed, but it's also not really on our side anyway. I don't think we can do anything about it.

So the recommendation is to do pretrain/get_network in such a way that it will not cause a re-init too often, or even every time.

albertz avatar Sep 01 '22 12:09 albertz

I'm reopening this because I have a potential idea for a workaround:

After a number of re-inits, we can just restart RETURNN, e.g. as explained here. This can be fully automatic. This is a quite simple but pragmatic approach. There could be a flag to enable this. Maybe also some warnings with automatic mem measurement if the flag is not enabled or so.

albertz avatar Sep 05 '22 11:09 albertz

We could introduce a new option like restart_after_num_net_reinit = 20 or so. What would be a good default? Or disabled by default? (@mmz33 @JackTemaki)

albertz avatar Oct 14 '22 19:10 albertz

Of course there should be some test case to test this. The test case is maybe the most difficult part about the whole feature, which is otherwise quite trivial. Obviously the test should be as simple as possible. The test should probably start RETURNN in a subprocess, somehow with a very minimal net, and trigger this somehow, and then test for the restart somehow, maybe by automatically creating a new file when the config is loaded or so.

albertz avatar Oct 14 '22 20:10 albertz

Now with returnn-common, with a bigger network (not in terms of parameters, but in terms of net dict size), specifically with the Conformer, this seems to become more of a problem again. Now eve after only 4 reinits or so, I see memory issues.

I'm also still not 100% confident that this is purely a problem of TF, and that RETURNN is all fine. I think we never actually did some memory profiling on this, or did we? I think we should study it a bit more. At least some Python mem profiling, to see whether the memory is on Python objects (those should be easily detectable, and also the source of them), or not (then probably TF).

albertz avatar Oct 24 '22 09:10 albertz

Note, with returnn-common, it goes through all the returnn-common model construction code in each epoch. But then it creates a net dict, and it compares this to the previous, and if there is no change, it will skip the reinit. However, the returnn-common model construction code, maybe that also causes mem leaks? I just looked a bit through the log, watching how RSS: ... changes, but it doesn't seem like it really changes in those epochs where there is no reinit. It fluctuates between 3.3GB and 4.7GB or so. Then however at the first reinit, in subepoch 11, it increases and now fluctuates between 4.5GB and 6.1GB. I assume it will increase similarly at the further reinits but otherwise fluctuates in such a range +-0.8GB.

albertz avatar Oct 24 '22 12:10 albertz

Via #1182 (and a few follow-up commits), I implemented the option restart_after_num_net_reinit now. You can even set restart_after_num_net_reinit = 1.

albertz avatar Oct 25 '22 20:10 albertz