M4Depth
M4Depth copied to clipboard
Code freezes during validation step while training
I run the next command
python3 m4depth_pipeline.py --train_datadir=/home/dmitry/datasets/MidAir/pb/train/ --val_datadir='/home/dmitry/datasets/MidAir/pb/test/' --log_dir=/home/dmitry/Documents/repos/M4Depth/logdir/ --dataset=midair --arch_depth=6 --db_seq_len=8 --seq_len=6 --num_batches=200000 -b=1 -g=1 --summary_interval_secs=120 --save_interval_secs=900 --validation_interval_secs=180 --eval_only_last_pic
With small debugging I found that code stuck at that line.
Some info about setup:
- tf 1.15
- 2080Ti
- MidAir dataset (RGB + Stereo Disparities)
@michael-fonder Do you have any ideas where should I look for the source of the freeze?
Hi @dimaxano
Sorry for the long delay.
The test dataset is quite large. Several minutes are necessary to process it completely. So it may appear that the code is freezing, while it is in fact still processing the validation set. I'll ask you two more pieces of information before digging the issue further:
- How long did you wait before concluding to a freeze?
- Is the GPU stil active while you experience a freeze?
Hi, @michael-fonder
I didn't measured time for the all test set, but I tried to remov all test proto samples from test
folder except one and run on them. Still experiencing freezes for several minutes (also cannot kill process with simple Ctrl-C, it just not responding).
And yeah, GPU utilization around zero during validation (checking from nvtop
), but memory is allocated.
Using a bunch of tf.Prints I found that the problem occurs because of tf.reduce_mean
in eval_func (all the get_* functions inside it). As soon as I commented reduce_mean
callings and replaced its results with just a tf.constant(1.0)
validation goes smoothly (but not very helpfully, hah)
Hi @dimaxano ,
Ok, I think that the issue is related to an initialization issue with the variables self.prev_f_pyr
and self.prev_d_pyr
in estimate_depth
when the validation graph is built. I think that the solution would be to use different variables for the train and test graphs.
I'll try to correct and test this carefully as soon as I have some slack time in the next few days.