automl icon indicating copy to clipboard operation
automl copied to clipboard

efficientnet v2| memory leak @ tf2 implementation ?

Open exx8 opened this issue 3 years ago • 4 comments

I tried running the efficientnet v2 implementation: python3 main_tf2.py --mode=train --model_name=efficientnetv2-s --dataset_cfg=imagenet --model_dir=/home/eran/rm-debug/efficientnet/models/automl/efficientnetv2 --use_tpu=False --data_dir=/sda/Eran/imagenet/imagenet-home/tf_records/train. While training I get a memory leak, the memory of the process keep increasing, even when the epoch ends. It reaches hundreds of GB. Restarting the training porcess free all the memory, and it seems that the process can continue from where it stopped, even saving the new checkpoint. Can you solve the bug? hparams.py.txt

Thanks

exx8 avatar May 08 '22 18:05 exx8

--hparam_str="data.cache=False"

fsx950223 avatar May 09 '22 13:05 fsx950223

it seems to improve the situation a lot! Are the any other tricks? currently it's about 50 GB. Also I get this notification: WARNING:tensorflow:Callback method on_train_batch_end is slow compared to the batch time (batch time: 0.2466s vs on_train_batch_end time: 0.2629s). Check your callbacks. should I do something with that?

exx8 avatar May 10 '22 10:05 exx8

Hmmm the memory still grows between epoch and epoch, within one epoch, it was enlarged by 5GB. The question is why.

exx8 avatar May 10 '22 11:05 exx8

Try this https://github.com/google/automl/issues/923 works for me.

veiii avatar Jul 22 '22 13:07 veiii