stereo-magnification
stereo-magnification copied to clipboard
Training code seems to get stuck even for a very small set of input images
Thanks very much for sharing the code.
To do a quick test of the training code, I downloaded a few of the youtube clips from the RealEstate 10K dataset, and placed the extracted frames in stereo-magnification\images
directory. The corresponding camera files are in stereo-magnification\train
directory.
However, when I try to execute the train.py
the program doesn't proceed any further than session.run()
function (I think). I'm copy-pasting the log below (please note that I've removed some of the warning messages related to some deprecated functions). I don't see any progress following the line INFO:tensorflow:parameter_count = 16892227
even after waiting for several (over 10) hours. Since I placed just a few (around 25) low-resolution images in the images
directory, I was expecting the training to finish within a few hours.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:Trainable variables:
INFO:tensorflow:net/conv1_1/weights:0
INFO:tensorflow:net/conv1_1/LayerNorm/beta:0
INFO:tensorflow:net/conv1_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv1_2/weights:0
INFO:tensorflow:net/conv1_2/LayerNorm/beta:0
INFO:tensorflow:net/conv1_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv2_1/weights:0
INFO:tensorflow:net/conv2_1/LayerNorm/beta:0
INFO:tensorflow:net/conv2_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv2_2/weights:0
INFO:tensorflow:net/conv2_2/LayerNorm/beta:0
INFO:tensorflow:net/conv2_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv3_1/weights:0
INFO:tensorflow:net/conv3_1/LayerNorm/beta:0
INFO:tensorflow:net/conv3_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv3_2/weights:0
INFO:tensorflow:net/conv3_2/LayerNorm/beta:0
INFO:tensorflow:net/conv3_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv3_3/weights:0
INFO:tensorflow:net/conv3_3/LayerNorm/beta:0
INFO:tensorflow:net/conv3_3/LayerNorm/gamma:0
INFO:tensorflow:net/conv4_1/weights:0
INFO:tensorflow:net/conv4_1/LayerNorm/beta:0
INFO:tensorflow:net/conv4_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv4_2/weights:0
INFO:tensorflow:net/conv4_2/LayerNorm/beta:0
INFO:tensorflow:net/conv4_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv4_3/weights:0
INFO:tensorflow:net/conv4_3/LayerNorm/beta:0
INFO:tensorflow:net/conv4_3/LayerNorm/gamma:0
INFO:tensorflow:net/conv6_1/weights:0
INFO:tensorflow:net/conv6_1/LayerNorm/beta:0
INFO:tensorflow:net/conv6_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv6_2/weights:0
INFO:tensorflow:net/conv6_2/LayerNorm/beta:0
INFO:tensorflow:net/conv6_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv6_3/weights:0
INFO:tensorflow:net/conv6_3/LayerNorm/beta:0
INFO:tensorflow:net/conv6_3/LayerNorm/gamma:0
INFO:tensorflow:net/conv7_1/weights:0
INFO:tensorflow:net/conv7_1/LayerNorm/beta:0
INFO:tensorflow:net/conv7_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv7_2/weights:0
INFO:tensorflow:net/conv7_2/LayerNorm/beta:0
INFO:tensorflow:net/conv7_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv8_1/weights:0
INFO:tensorflow:net/conv8_1/LayerNorm/beta:0
INFO:tensorflow:net/conv8_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv8_2/weights:0
INFO:tensorflow:net/conv8_2/LayerNorm/beta:0
INFO:tensorflow:net/conv8_2/LayerNorm/gamma:0
INFO:tensorflow:net/color_pred/weights:0
INFO:tensorflow:net/color_pred/biases:0
INFO:tensorflow:parameter_count = 16892227
My system's configuration are provided below: OS: Ubuntu 19.04 Python: 2.7 Tensorflow version: 1.13.1 GPU information:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro RTX 4000 Off | 00000000:02:00.0 On | N/A |
| 30% 39C P8 12W / 125W | 7678MiB / 7977MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P4000 Off | 00000000:03:00.0 Off | N/A |
| 46% 33C P8 5W / 105W | 91MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 17066 G /usr/lib/xorg/Xorg 313MiB |
| 0 31117 C python 7353MiB |
| 1 31117 C python 79MiB |
+-----------------------------------------------------------------------------+
It would be great if you could provide some insight for solving this issue.
Thank you very much.
It sounds like the input pipeline might be running forever but not producing any data. Since it's not much data, can you show us exactly what your directory structure and files looks like with "ls -R"?
@reyet Thank you so much for your reply. Please see the directory tree structure below. I have removed some unnecessary parts of the tree to keep it concise. Please note that in this example I used just 5 camera specification files (in the train directory).
indranil@root3563:~/stereo_magnification$ tree
.
├── checkpoints
├── CONTRIBUTING.md
├── evaluate.py
├── examples
├── geometry
├── images
│ ├── Eh6a2OB-xAg
│ │ ├── Eh6a2OB-xAg_158992000.jpg
│ │ ├── Eh6a2OB-xAg_159025000.jpg
│ │ ├── Eh6a2OB-xAg_159059000.jpg
│ │ ├── Eh6a2OB-xAg_159092000.jpg
│ │ ├── Eh6a2OB-xAg_159125000.jpg
│ │ ├── Eh6a2OB-xAg_159159000.jpg
│ │ ├── Eh6a2OB-xAg_159192000.jpg
│ │ ├── Eh6a2OB-xAg_159225000.jpg
│ │ ├── Eh6a2OB-xAg_159259000.jpg
│ │ ├── Eh6a2OB-xAg_159292000.jpg
│ │ ├── Eh6a2OB-xAg_159326000.jpg
│ │ ├── Eh6a2OB-xAg_159359000.jpg
│ │ ├── Eh6a2OB-xAg_159392000.jpg
│ │ ├── Eh6a2OB-xAg_159426000.jpg
│ │ ├── Eh6a2OB-xAg_159459000.jpg
│ │ ├── Eh6a2OB-xAg_159492000.jpg
│ │ ├── Eh6a2OB-xAg_159526000.jpg
│ │ ├── Eh6a2OB-xAg_159559000.jpg
│ │ ├── Eh6a2OB-xAg_159592000.jpg
│ │ ├── Eh6a2OB-xAg_159626000.jpg
│ │ ├── Eh6a2OB-xAg_159659000.jpg
│ │ ├── Eh6a2OB-xAg_159693000.jpg
│ │ ├── Eh6a2OB-xAg_159726000.jpg
│ │ ├── Eh6a2OB-xAg_159759000.jpg
│ │ └── Eh6a2OB-xAg_159793000.jpg
│ ├── f7o82npo-Ww
│ │ ├── f7o82npo-Ww_42076000.jpg
│ │ ├── f7o82npo-Ww_42109000.jpg
│ │ ├── f7o82npo-Ww_42142000.jpg
│ │ ├── f7o82npo-Ww_42176000.jpg
│ │ ├── f7o82npo-Ww_42209000.jpg
│ │ ├── f7o82npo-Ww_42243000.jpg
│ │ ├── f7o82npo-Ww_42276000.jpg
│ │ ├── f7o82npo-Ww_42343000.jpg
│ │ ├── f7o82npo-Ww_42376000.jpg
│ │ ├── f7o82npo-Ww_42409000.jpg
│ │ ├── f7o82npo-Ww_42443000.jpg
│ │ ├── f7o82npo-Ww_42476000.jpg
│ │ ├── f7o82npo-Ww_42509000.jpg
│ │ ├── f7o82npo-Ww_42543000.jpg
│ │ ├── f7o82npo-Ww_42576000.jpg
│ │ ├── f7o82npo-Ww_42610000.jpg
│ │ ├── f7o82npo-Ww_42643000.jpg
│ │ ├── f7o82npo-Ww_42676000.jpg
│ │ ├── f7o82npo-Ww_42742000.jpg
│ │ ├── f7o82npo-Ww_42776000.jpg
│ │ ├── f7o82npo-Ww_42809000.jpg
│ │ ├── f7o82npo-Ww_42842000.jpg
│ │ ├── f7o82npo-Ww_42876000.jpg
│ │ ├── f7o82npo-Ww_42909000.jpg
│ │ ├── f7o82npo-Ww_42943000.jpg
│ │ ├── f7o82npo-Ww_42976000.jpg
│ │ ├── f7o82npo-Ww_43009000.jpg
│ │ ├── f7o82npo-Ww_43043000.jpg
│ │ ├── f7o82npo-Ww_43076000.jpg
│ │ ├── f7o82npo-Ww_43109000.jpg
│ │ ├── f7o82npo-Ww_43143000.jpg
│ │ ├── f7o82npo-Ww_43176000.jpg
│ │ ├── f7o82npo-Ww_43210000.jpg
│ │ ├── f7o82npo-Ww_43243000.jpg
│ │ ├── f7o82npo-Ww_43276000.jpg
│ │ ├── f7o82npo-Ww_43310000.jpg
│ │ ├── f7o82npo-Ww_43343000.jpg
│ │ └── f7o82npo-Ww_43376000.jpg
│ ├── GclE7CWkz1s
│ │ ├── GclE7CWkz1s_150300000.jpg
│ │ ├── GclE7CWkz1s_150333333.jpg
│ │ ├── GclE7CWkz1s_150366667.jpg
│ │ ├── GclE7CWkz1s_150400000.jpg
│ │ ├── GclE7CWkz1s_150433333.jpg
│ │ ├── GclE7CWkz1s_150466667.jpg
│ │ ├── GclE7CWkz1s_150500000.jpg
│ │ ├── GclE7CWkz1s_150533333.jpg
│ │ ├── GclE7CWkz1s_150566667.jpg
│ │ ├── GclE7CWkz1s_150600000.jpg
│ │ ├── GclE7CWkz1s_150633333.jpg
│ │ ├── GclE7CWkz1s_150666667.jpg
│ │ ├── GclE7CWkz1s_150700000.jpg
│ │ ├── GclE7CWkz1s_150733333.jpg
│ │ ├── GclE7CWkz1s_150766667.jpg
│ │ ├── GclE7CWkz1s_150800000.jpg
│ │ ├── GclE7CWkz1s_150833333.jpg
│ │ ├── GclE7CWkz1s_150866667.jpg
│ │ ├── GclE7CWkz1s_150900000.jpg
│ │ ├── GclE7CWkz1s_150933333.jpg
│ │ ├── GclE7CWkz1s_150966667.jpg
│ │ ├── GclE7CWkz1s_151000000.jpg
│ │ ├── GclE7CWkz1s_151033333.jpg
│ │ ├── GclE7CWkz1s_151066667.jpg
│ │ ├── GclE7CWkz1s_151100000.jpg
│ │ ├── GclE7CWkz1s_151133333.jpg
│ │ ├── GclE7CWkz1s_151166667.jpg
│ │ ├── GclE7CWkz1s_151200000.jpg
│ │ ├── GclE7CWkz1s_151233333.jpg
│ │ ├── GclE7CWkz1s_151266667.jpg
│ │ ├── GclE7CWkz1s_151300000.jpg
│ │ ├── GclE7CWkz1s_151333333.jpg
│ │ ├── GclE7CWkz1s_151366667.jpg
│ │ ├── GclE7CWkz1s_151400000.jpg
│ │ ├── GclE7CWkz1s_151433333.jpg
│ │ ├── GclE7CWkz1s_151466667.jpg
│ │ ├── GclE7CWkz1s_151500000.jpg
│ │ ├── GclE7CWkz1s_151533333.jpg
│ │ ├── GclE7CWkz1s_151566667.jpg
│ │ ├── GclE7CWkz1s_151600000.jpg
│ │ └── GclE7CWkz1s_151633333.jpg
│ ├── OT04jHhqYyw
│ │ ├── OT04jHhqYyw_110133333.jpg
│ │ ├── OT04jHhqYyw_110166667.jpg
│ │ ├── OT04jHhqYyw_110200000.jpg
│ │ ├── OT04jHhqYyw_110233333.jpg
│ │ ├── OT04jHhqYyw_110266667.jpg
│ │ ├── OT04jHhqYyw_110300000.jpg
│ │ ├── OT04jHhqYyw_110333333.jpg
│ │ ├── OT04jHhqYyw_110366667.jpg
│ │ ├── OT04jHhqYyw_110400000.jpg
│ │ ├── OT04jHhqYyw_110433333.jpg
│ │ ├── OT04jHhqYyw_110466667.jpg
│ │ ├── OT04jHhqYyw_110500000.jpg
│ │ ├── OT04jHhqYyw_110533333.jpg
│ │ ├── OT04jHhqYyw_110566667.jpg
│ │ ├── OT04jHhqYyw_110600000.jpg
│ │ ├── OT04jHhqYyw_110633333.jpg
│ │ └── OT04jHhqYyw_110666667.jpg
│ └── xTOs9uW6_bo
│ ├── xTOs9uW6_bo_86333333.jpg
│ ├── xTOs9uW6_bo_86366667.jpg
│ ├── xTOs9uW6_bo_86400000.jpg
│ ├── xTOs9uW6_bo_86433333.jpg
│ ├── xTOs9uW6_bo_86466667.jpg
│ ├── xTOs9uW6_bo_86500000.jpg
│ ├── xTOs9uW6_bo_86533333.jpg
│ ├── xTOs9uW6_bo_86566667.jpg
│ ├── xTOs9uW6_bo_86600000.jpg
│ ├── xTOs9uW6_bo_86633333.jpg
│ ├── xTOs9uW6_bo_86666667.jpg
│ ├── xTOs9uW6_bo_86700000.jpg
│ ├── xTOs9uW6_bo_86733333.jpg
│ ├── xTOs9uW6_bo_86766667.jpg
│ ├── xTOs9uW6_bo_86800000.jpg
│ ├── xTOs9uW6_bo_86833333.jpg
│ ├── xTOs9uW6_bo_86866667.jpg
│ ├── xTOs9uW6_bo_86900000.jpg
│ ├── xTOs9uW6_bo_86933333.jpg
│ ├── xTOs9uW6_bo_86966667.jpg
│ ├── xTOs9uW6_bo_87000000.jpg
│ ├── xTOs9uW6_bo_87033333.jpg
│ ├── xTOs9uW6_bo_87066667.jpg
│ ├── xTOs9uW6_bo_87100000.jpg
│ ├── xTOs9uW6_bo_87133333.jpg
│ └── xTOs9uW6_bo_87166667.jpg
├── __init__.py
├── LICENSE
├── models
├── mpi_from_images.py
├── README.md
├── scripts
├── stereomag
├── test.py
├── third_party
├── train
│ ├── 0a0a998c176713fd.txt
│ ├── 0a16a992457df4a8.txt
│ ├── 0a1a4430d0061081.txt
│ ├── 0a6d080826d442d9.txt
│ └── 0a7d51ed7990aefd.txt
├── train.py
Best regards, Indranil.
Hi @reyet, Did you get a change to take a look at the directory structure? Do you think there is a problem with it? Thanks very much in advance.
Hi @PuneetKohli , @olivertai, @lolz0r, If you have been able to run the training part, could you kindly suggest anything based on your experience as to how I may be able to resolve this issue? Thanks very much for your time and help.
@indranilsinharoy did you find a solution for this? I am also having the same issue while training the network.
@bruce-wayne99 Unfortunately, I've not been able to solve it, and I have temporarily moved on to other things. If I had to try again with some different environment, I would try using Ubuntu 18.x instead of Ubuntu 19.04 (not sure if you are using the same OS or not) and also try different CUDA version ... just some thoughts.
Sorry for not replying sooner. I did take a look at your directory structure, and it looked correct to me so I'm afraid I don't know what is going wrong.
@reyet No problem at all. Thank you very much. I was guessing something same :-) Once I get back to it I'll try some more things (mostly with the environment I guess). If I do find the problem, I'll surely post it here.
@indranilsinharoy thanks for the info. The issue was fixed after I changed the Cuda version to 9.0, I was using Cuda version 10.0 before. OS: Ubuntu 16.04 Cuda: cudnn/7.6.4-cuda-9.0, cuda9.0 Tensorflow package: tensorflow-gpu==1.14.0
@bruce-wayne99 Thanks very much for posting your solution here. I hope it will help several others if they face similar problems. At least I know that is the first thing I must do!
@reyet, @bruce-wayne99 Please feel free to close the issue if you see fit.
Thanks @bruce-wayne99 for figuring that out!
@reyet @indranilsinharoy, Just went through the code more briefly and I think the version is not an issue, if you look at loader.py
file, for generating sequences they use a stride, by default min_stride=3
and max_stride=10
, each time a random number is generated between min_stride
and max_stride
to choose the stride, after choosing the stride, they select a subsequence of length 10(sequence_length=10)
from a given sequence and also your sequence should have a minimum length of (sequence_length - 1) * stride + 1
, they remove all sequences which are of length below this. so in this case since stride is random number let us take it (min_stride + max_stride)//2
which is (3+10)//2 = 6
. So each sequence of your data should contain at least contain (10-1)*6 + 1 = 55
frames, so if your sequences are below this length they are not used as data so may be data is becoming null and code is getting stuck, I was able to make it work by setting max_stride=min_stride=1
or another thing to do is to adjust the sequence_length
(reducing it) or you can just use bigger dataset (increasing number of frames in a given sequence).
@bruce-wayne99 Thanks very much. I'll check it out.
Hi @indranilsinharoy, I just want to ask how did u manage to prepare the RealEstate10K dataset for training. Is the txt files in the train folder is the same with download txt file from the dataset ? Also, the image in each folder are the extracted files for each scene id right ? Do you perform any pre-processing on them ?