automl efficientnet v2| memory leak @ tf2 implementation ?

I tried running the efficientnet v2 implementation: python3 main_tf2.py --mode=train --model_name=efficientnetv2-s --dataset_cfg=imagenet --model_dir=/home/eran/rm-debug/efficientnet/models/automl/efficientnetv2 --use_tpu=False --data_dir=/sda/Eran/imagenet/imagenet-home/tf_records/train. While training I get a memory leak, the memory of the process keep increasing, even when the epoch ends. It reaches hundreds of GB. Restarting the training porcess free all the memory, and it seems that the process can continue from where it stopped, even saving the new checkpoint. Can you solve the bug? hparams.py.txt

Thanks

May 08 '22 18:05 exx8

--hparam_str="data.cache=False"

May 09 '22 13:05 fsx950223

it seems to improve the situation a lot! Are the any other tricks? currently it's about 50 GB. Also I get this notification: WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.2466s vs on_train_batch_end time: 0.2629s). Check your callbacks. should I do something with that?

May 10 '22 10:05 exx8

Hmmm the memory still grows between epoch and epoch, within one epoch, it was enlarged by 5GB. The question is why.

May 10 '22 11:05 exx8

Try this https://github.com/google/automl/issues/923 works for me.

Jul 22 '22 13:07 veiii