RelationNetworks-CLEVR logfile is not showing any runs for the test set. The plots also don't show anything for test set and accuracy.

When I run the code, I get the following output:

(rn_env) exx@ubuntu:/data/Rudra/RelationNetworks-CLEVR$ python                          
Python 3.6.6 (default, Jun 28 2018, 00:00:00)                                         
[GCC 4.8.4] on linux                                             
Type "help", "copyright", "credits" or "license" for more information.                 
>>> import torch                                                   
>>> exit()                                                                     
(rn_env) exx@ubuntu:/data/Rudra/RelationNetworks-CLEVR$ pyton -m train --clevr-dir /data/DATASETS/CLEVR_v1.0/ --model 'original-fp' | tee logfile.log
No command 'pyton' found, did you mean:                           
 Command 'python' from package 'python-minimal' (main)                                                                                                                                                             
 Command 'pytone' from package 'pytone' (universe)                    
pyton: command not found                                           
(rn_env) exx@ubuntu:/data/Rudra/RelationNetworks-CLEVR$ python -m train --clevr-dir /data/DATASETS/CLEVR_v1.0/ --model 'original-fp' | tee logfile.log                                                             
TRAIN:   0%|                                                                                                                                                                               | 0/350 [00:00<?, ?it/sL
oaded hyperparameters from configuration config.json, model: original-fp: {'state_description': False, 'g_layers': [256, 256, 256, 256], 'question_injection_position': 0, 'f_fc1': 256, 'f_fc2': 256, 'dropout': 0
.5, 'lstm_hidden': 128, 'lstm_word_emb': 32, 'rl_in_size': 52}                                                                                         
Building word dictionaries from all the words in the dataset...                                   
==> using cached dictionaries: /data/DATASETS/CLEVR_v1.0/questions/CLEVR_built_dictionaries.pkl
Word dictionary completed!                                                                                                                                                                                         
Initializing CLEVR dataset...
==> using cached questions: /data/DATASETS/CLEVR_v1.0/questions/CLEVR_train_questions.pkl
==> using cached questions: /data/DATASETS/CLEVR_v1.0/questions/CLEVR_val_questions.pkl
CLEVR dataset initialized!
Supposing original DeepMind model
Training (350 epochs) is starting...
Dataset reinitialized with batch size 640
Current learning rate: 1e-05
                                                                                                                                                                                                                  T
raceback (most recent call last):███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 1093/1094 [11:21:28<00:37, 37.41s/it, loss=1.92]
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/Rudra/RelationNetworks-CLEVR/train.py", line 418, in <module>
    main(args)
  File "/data/Rudra/RelationNetworks-CLEVR/train.py", line 356, in main
    train(clevr_train_loader, model, optimizer, epoch, args)
  File "/data/Rudra/RelationNetworks-CLEVR/train.py", line 40, in train
    output = model(img, qst)
  File "/data/Rudra/virtualenvs/rn_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/Rudra/RelationNetworks-CLEVR/model.py", line 200, in forward
    x = torch.cat([x, self.coord_tensor], 1)    # (B x 24+2 x 8*8)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 469 and 640 in dimension 0 at /pytorch/torch/lib/TH/generic/THTensorMath.c:2897
Train Epoch: 1 [0/700160 (0%)] Train loss: 39.945804595947266
Train Epoch: 1 [6400/700160 (1%)] Train loss: 36.57775611877442
Train Epoch: 1 [12800/700160 (2%)] Train loss: 29.848896408081053
Train Epoch: 1 [19200/700160 (3%)] Train loss: 24.984291648864748
Train Epoch: 1 [25600/700160 (4%)] Train loss: 20.945134353637695
.
.
.
Train Epoch: 1 [684800/700160 (98%)] Train loss: 1.8508247494697572
Train Epoch: 1 [691200/700160 (99%)] Train loss: 1.8768051743507386
Train Epoch: 1 [697600/700160 (100%)] Train loss: 1.8581566572189332

(rn_env) exx@ubuntu:/data/Rudra/RelationNetworks-CLEVR$

I have also attached my logfile with this. When I run the plot function, I get empty plots for everything apart from training loss. Please let me know where the issue might be. Thanks.

logfile.log

Oct 05 '18 21:10 saharudra

Hi @saharudra, this issue is probably due to a batch handling issue on the Multi GPU setup. You should be able to run the code by simply removing the condition (the entire line): https://github.com/mesnico/RelationNetworks-CLEVR/blob/b8e0e7af12408877c8a18d8f2802d88138605983/model.py#L196 This is not the most efficient solution; however, if that is the problem, I will fix it permanently as soon as possible using a better approach. Thanks!

Oct 09 '18 13:10 mesnico

Hi @mesnico, I will give this a try and let you know the outcome here. Thanks!

Oct 09 '18 18:10 saharudra

RelationNetworks-CLEVR RelationNetworks-CLEVR copied to clipboard

logfile is not showing any runs for the test set. The plots also don't show anything for test set and accuracy.

RelationNetworks-CLEVR
RelationNetworks-CLEVR copied to clipboard