Mava
Mava copied to clipboard
Memory usage keeps on increasing
When trying to run
python examples/meltingpot/train_on_substrates.py
The RAM usage continuously keeps on increasing till it's full and gets killed.
Is this expected or is there a memory leak somewhere?
Thank you for reporting this @kinalmehta I think this results from repeated retracing by tf.function. It appears that this memory issue was not experienced initially as the script was run on a VM. I'll look into the retracing issue to ensure that the script runs seamlessly, even for local debugging.
The issue seems to be because of the parameters being passed to _policy method which are not TF objects and @tf.function only seems to support TF objects based on following warnings from TF
[evaluator/0] W0114 15:54:21.299422 140487848744128 def_function.py:150] 6 out of the last 6 calls to <function MADQNFeedForwardExecutor._policy at 0x7fc52c2078b0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
Is there any workaround or any plans to handle this. I'm ready to help with the implementation side.
You are right @kinalmehta You can use the _policy method without tf.function to avoid retracing.
@ldfrancis could we perhaps look into this in more detail? Since tf.function can provide significant speed ups in some cases, it might be worth seeing if there is a simple fix (i.e ensure we feed the policy a TF object) to avoid the retracing. @EdanToledo this could also be the source behind the memory issues in PPO.
Closing all TF issues as we are depreciating our TF systems.