automl icon indicating copy to clipboard operation
automl copied to clipboard

Since EfficientDet requieres TensorFlow > 2.8 we can't train anymore with CUDA

Open fitoule opened this issue 3 years ago • 4 comments

I have only one NVIDIA GPU, I was training with TensorFlow 2.5.2 because of the bug with GPU and multiprocessing.

  • TF2.8 and No Child Process => works but Memory Leak :(

  • TF2.8 and Child Process => CUDA error on the first epoch because GPU has been taken by the main process https://github.com/google/automl/issues/855

  • TF2.5.2 and Child Process => does not work anymore since fix determinism

It was working with TensorFlow until 2.5.2 but now efficientdet require TF > 2.8 so I am stuck. I have to find code before "determinism" I think

fitoule avatar Apr 15 '22 08:04 fitoule

  1. Migrate to tf2
  2. Set num_epochs=1 and num_examples_per_epoch=num_epochs * num_exampels

fsx950223 avatar Apr 15 '22 09:04 fsx950223

You mean I need to use the code under efficientdet/tf2/train.py ? or migrate by myself efficientdet/main.py ?

thank you

fitoule avatar Apr 19 '22 07:04 fitoule

@fitoule you mentioned some memory leak. I am facing too a memory leak. Can you give more info?

exx8 avatar May 08 '22 19:05 exx8

I faced with the same problem. I used traineval mode, tensorflow 2.10 (then 2.13), in both cases there was memory leak after first epoch. Training was fine, but during evaluation probably CocoCallback cause memory leak. I commented this line (https://github.com/google/automl/blob/master/efficientdet/tf2/train_lib.py#L220) and everything is fine.

mateusz-wozny avatar Nov 21 '23 13:11 mateusz-wozny