Mava icon indicating copy to clipboard operation
Mava copied to clipboard

Memory usage keeps on increasing

Open kinalmehta opened this issue 2 years ago • 4 comments

When trying to run

python examples/meltingpot/train_on_substrates.py

The RAM usage continuously keeps on increasing till it's full and gets killed.

Is this expected or is there a memory leak somewhere?

kinalmehta avatar Jan 10 '22 09:01 kinalmehta

Thank you for reporting this @kinalmehta I think this results from repeated retracing by tf.function. It appears that this memory issue was not experienced initially as the script was run on a VM. I'll look into the retracing issue to ensure that the script runs seamlessly, even for local debugging.

ldfrancis avatar Jan 11 '22 08:01 ldfrancis

The issue seems to be because of the parameters being passed to _policy method which are not TF objects and @tf.function only seems to support TF objects based on following warnings from TF

[evaluator/0] W0114 15:54:21.299422 140487848744128 def_function.py:150] 6 out of the last 6 calls to <function MADQNFeedForwardExecutor._policy at 0x7fc52c2078b0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.

Is there any workaround or any plans to handle this. I'm ready to help with the implementation side.

kinalmehta avatar Jan 14 '22 12:01 kinalmehta

You are right @kinalmehta You can use the _policy method without tf.function to avoid retracing.

ldfrancis avatar Jan 21 '22 11:01 ldfrancis

@ldfrancis could we perhaps look into this in more detail? Since tf.function can provide significant speed ups in some cases, it might be worth seeing if there is a simple fix (i.e ensure we feed the policy a TF object) to avoid the retracing. @EdanToledo this could also be the source behind the memory issues in PPO.

arnupretorius avatar Feb 01 '22 09:02 arnupretorius

Closing all TF issues as we are depreciating our TF systems.

DriesSmit avatar Sep 08 '22 14:09 DriesSmit