returnn
returnn copied to clipboard
Memory leak with pretraining
We have the case (reported by @mmz33) that a config with pretraining increases memory way over 16GB, while the same config without pretraining (fixed network, no reconstruction) stays around 6GB.
We look at the RSS memory usage. The job is killed by SGE when over the specified limit (e.g. 16GB).
This might be related to the net construction, or recreation of a new TF session, and/or new TF graph.
@mmz33 or @JackTemaki might be able to provide some further logs or details.
Small update: There was a small bug related to the handling of loss_opts and loss_scale (fixed here: https://github.com/rwth-i6/returnn/commit/e4be10e7a322420e8296f0104eeb6a5e9ba97330) which caused a difference in the net dict, so the check each sub epoch whether the net dict changed was always True, and thus it caused a net reconstruction every sub epoch. However, within pretrain repetitions, the net dict actually did not change, so many of those reconstructions were unnecessary. Now with the fix, the memory consumption looks already much better. It looks almost as if there is no leak anymore, or at least there is no real problem anymore. (Reported by @mmz33)
To update: With the mentioned fix from above, we have not seen any problems anymore. It's probably not really fixed, but it's also not really on our side anyway. I don't think we can do anything about it.
So the recommendation is to do pretrain/get_network in such a way that it will not cause a re-init too often, or even every time.
I'm reopening this because I have a potential idea for a workaround:
After a number of re-inits, we can just restart RETURNN, e.g. as explained here. This can be fully automatic. This is a quite simple but pragmatic approach. There could be a flag to enable this. Maybe also some warnings with automatic mem measurement if the flag is not enabled or so.
We could introduce a new option like restart_after_num_net_reinit = 20 or so. What would be a good default? Or disabled by default? (@mmz33 @JackTemaki)
Of course there should be some test case to test this. The test case is maybe the most difficult part about the whole feature, which is otherwise quite trivial. Obviously the test should be as simple as possible. The test should probably start RETURNN in a subprocess, somehow with a very minimal net, and trigger this somehow, and then test for the restart somehow, maybe by automatically creating a new file when the config is loaded or so.
Now with returnn-common, with a bigger network (not in terms of parameters, but in terms of net dict size), specifically with the Conformer, this seems to become more of a problem again. Now eve after only 4 reinits or so, I see memory issues.
I'm also still not 100% confident that this is purely a problem of TF, and that RETURNN is all fine. I think we never actually did some memory profiling on this, or did we? I think we should study it a bit more. At least some Python mem profiling, to see whether the memory is on Python objects (those should be easily detectable, and also the source of them), or not (then probably TF).
Note, with returnn-common, it goes through all the returnn-common model construction code in each epoch. But then it creates a net dict, and it compares this to the previous, and if there is no change, it will skip the reinit. However, the returnn-common model construction code, maybe that also causes mem leaks? I just looked a bit through the log, watching how RSS: ... changes, but it doesn't seem like it really changes in those epochs where there is no reinit. It fluctuates between 3.3GB and 4.7GB or so. Then however at the first reinit, in subepoch 11, it increases and now fluctuates between 4.5GB and 6.1GB. I assume it will increase similarly at the further reinits but otherwise fluctuates in such a range +-0.8GB.
Via #1182 (and a few follow-up commits), I implemented the option restart_after_num_net_reinit now. You can even set restart_after_num_net_reinit = 1.