M4Depth icon indicating copy to clipboard operation
M4Depth copied to clipboard

Code freezes during validation step while training

Open dimaxano opened this issue 3 years ago • 4 comments

I run the next command

python3 m4depth_pipeline.py --train_datadir=/home/dmitry/datasets/MidAir/pb/train/ --val_datadir='/home/dmitry/datasets/MidAir/pb/test/'  --log_dir=/home/dmitry/Documents/repos/M4Depth/logdir/ --dataset=midair --arch_depth=6 --db_seq_len=8 --seq_len=6 --num_batches=200000 -b=1 -g=1 --summary_interval_secs=120 --save_interval_secs=900 --validation_interval_secs=180 --eval_only_last_pic

With small debugging I found that code stuck at that line.

Some info about setup:

  • tf 1.15
  • 2080Ti
  • MidAir dataset (RGB + Stereo Disparities)

@michael-fonder Do you have any ideas where should I look for the source of the freeze?

dimaxano avatar Aug 14 '21 09:08 dimaxano

Hi @dimaxano

Sorry for the long delay.

The test dataset is quite large. Several minutes are necessary to process it completely. So it may appear that the code is freezing, while it is in fact still processing the validation set. I'll ask you two more pieces of information before digging the issue further:

  • How long did you wait before concluding to a freeze?
  • Is the GPU stil active while you experience a freeze?

michael-fonder avatar Aug 23 '21 13:08 michael-fonder

Hi, @michael-fonder

I didn't measured time for the all test set, but I tried to remov all test proto samples from test folder except one and run on them. Still experiencing freezes for several minutes (also cannot kill process with simple Ctrl-C, it just not responding). And yeah, GPU utilization around zero during validation (checking from nvtop), but memory is allocated.

Using a bunch of tf.Prints I found that the problem occurs because of tf.reduce_mean in eval_func (all the get_* functions inside it). As soon as I commented reduce_mean callings and replaced its results with just a tf.constant(1.0) validation goes smoothly (but not very helpfully, hah)

dimaxano avatar Aug 23 '21 14:08 dimaxano

Hi @dimaxano ,

Ok, I think that the issue is related to an initialization issue with the variables self.prev_f_pyrand self.prev_d_pyr in estimate_depth when the validation graph is built. I think that the solution would be to use different variables for the train and test graphs.

I'll try to correct and test this carefully as soon as I have some slack time in the next few days.

michael-fonder avatar Aug 25 '21 15:08 michael-fonder

Agree, you maybe right, because when I replace est_resized here and here with gt, validation also goes smoothly

I'll try to implement your fix and let you know if that works

dimaxano avatar Aug 26 '21 08:08 dimaxano