efficientnet v2| memory leak @ tf2 implementation ?
I tried running the efficientnet v2 implementation:
python3 main_tf2.py --mode=train --model_name=efficientnetv2-s --dataset_cfg=imagenet --model_dir=/home/eran/rm-debug/efficientnet/models/automl/efficientnetv2 --use_tpu=False --data_dir=/sda/Eran/imagenet/imagenet-home/tf_records/train.
While training I get a memory leak, the memory of the process keep increasing, even when the epoch ends. It reaches hundreds of GB.
Restarting the training porcess free all the memory, and it seems that the process can continue from where it stopped, even saving the new checkpoint.
Can you solve the bug?
hparams.py.txt
Thanks
--hparam_str="data.cache=False"
it seems to improve the situation a lot!
Are the any other tricks? currently it's about 50 GB.
Also I get this notification:
WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.2466s vs on_train_batch_end time: 0.2629s). Check your callbacks.
should I do something with that?
Hmmm the memory still grows between epoch and epoch, within one epoch, it was enlarged by 5GB. The question is why.
Try this https://github.com/google/automl/issues/923 works for me.