ROMP Issues with Data Parallel

Hi,
I am trying to train the network starting from hrnet_pretrain.py. When I run the code on a node with 4 GPUS it runs totally fine. When I run it on a node with 8 GPUS I get the following error at the end of first epoch when performing evaluation. I tested this with val_batch_size of 16 and batch_size of 16:
INFO:root:------------------------------------------------------------------
INFO:root:Loading pytorch3d renderer as visualizer
INFO:root:start building model.
INFO:root:using fine_tune model:
WARNING:root:model  not exist!
INFO:root:Train all layers, except: ['_result_parser.params_map_parser.smpl_model.betas']
INFO:root:finished build model.
INFO:root:gathering datasets
INFO:root:Start loading 3DPW data.
INFO:root:Loading 3DPW in VIBE mode, split val
INFO:root:3DPW dataset val split total 1661 samples, loading mode vibe
INFO:root:gathering datasets
INFO:root:Start loading 3DPW data.
INFO:root:Loading 3DPW in VIBE mode, split test
INFO:root:3DPW dataset test split total 24423 samples, loading mode vibe
INFO:root:dataset_val_list:['pw3d']
INFO:root:evaluation_results_dict:['pw3d']
INFO:root:gathering datasets
INFO:root:CrowdHuman 2D detection data has been loaded, total 15000 samples
INFO:root:Crowdpose 2D keypoint data has been loaded, total 9963 samples
INFO:root:COCO 2D keypoint data has been loaded, total 26455 samples
INFO:root:Loaded MPII data total 9831 samples
INFO:root:LSP dataset total 6829 samples
INFO:root:All dataset length: 132275
INFO:root:Initialization of Trainer finished!
yaml_timestamp  /z/home/mkhoshle/ROMP-old/active_configs/active_context_2022-08-12_22_13_01.yaml
/z/home/mkhoshle/ROMP-old/romp/lib/config.py
visualize in gpu mode
Using ROMP v1
Confidence: 0.2
+-----------------+----------------+---------------+---------------+------------+-------+
|                 |   crowdhuman   |   crowdpose   |      coco     |    mpii    |  lsp  |
+-----------------+----------------+---------------+---------------+------------+-------+
|      Length     |     15000      |      9963     |     26455     |    9831    |  6829 |
|   Sample Prob.  |      0.2       |      0.24     |      0.2      |    0.2     |  0.16 |
| Expected length |     75000      |     41512     |     132275    |   49155    | 42681 |
|   Accum. Prob.  | 0.199951171875 | 0.43994140625 | 0.64013671875 | 0.83984375 |  1.0  |
|    Accum. ID.   |       0        |       0       |       0       |     0      |   0   |
+-----------------+----------------+---------------+---------------+------------+-------+
INFO:root:start training^M
INFO:root:Training all layers.^M
epoch hrnet_cm64_V1_resnet 150^M
0 epoch^M

INFO:root:evaluation result on 0 iters: ^M
INFO:root:Evaluation on pw3d^M
Epoch: [0][50/8267] Time 0.13 RUN 8.36 Lr 5e-05 Loss 2247.28 | Losses {'reg': 1247.28, 'det': 1716572.03, 'CenterMap': 1716572.03, 'P_KP2D': 388.0, 'MPJPE': 59.55, 'PAMPJPE': 145.88, 'Pose': 308.97, 'Shape': 1.17, 'Prior': 343.71}^M
Epoch: [0][100/8267] Time 0.00 RUN 7.80 Lr 5e-05 Loss 2078.91 | Losses {'reg': 1078.91, 'det': 3071.06, 'CenterMap': 3071.06, 'P_KP2D': 296.1, 'MPJPE': 55.95, 'PAMPJPE': 140.86, 'Pose': 289.31, 'Shape': 1.04, 'Prior': 295.64}^M
Epoch: [0][150/8267] Time 0.00 RUN 7.68 Lr 5e-05 Loss 1905.13 | Losses {'reg': 912.44, 'det': 1666.48, 'CenterMap': 1666.48, 'P_KP2D': 249.71, 'MPJPE': 48.74, 'PAMPJPE': 135.47, 'Pose': 253.73, 'Shape': 0.82, 'Prior': 223.96}^M
Epoch: [0][200/8267] Time 0.00 RUN 7.62 Lr 5e-05 Loss 1740.07 | Losses {'reg': 779.42, 'det': 1301.84, 'CenterMap': 1301.84, 'P_KP2D': 221.88, 'MPJPE': 43.34, 'PAMPJPE': 126.45, 'Pose': 230.39, 'Shape': 1.08, 'Prior': 156.28}^M
Epoch: [0][250/8267] Time 0.00 RUN 7.54 Lr 5e-05 Loss 1570.88 | Losses {'reg': 700.65, 'det': 925.07, 'CenterMap': 925.07, 'P_KP2D': 209.66, 'MPJPE': 42.81, 'PAMPJPE': 119.39, 'Pose': 216.76, 'Shape': 1.11, 'Prior': 110.9}^M
Epoch: [0][300/8267] Time 0.00 RUN 7.49 Lr 5e-05 Loss 1316.82 | Losses {'reg': 607.69, 'det': 719.95, 'CenterMap': 719.95, 'P_KP2D': 192.6, 'MPJPE': 37.45, 'PAMPJPE': 107.25, 'Pose': 191.35, 'Shape': 1.09, 'Prior': 77.95}^M
Epoch: [0][350/8267] Time 0.00 RUN 7.36 Lr 5e-05 Loss 1238.27 | Losses {'reg': 569.09, 'det': 676.29, 'CenterMap': 676.29, 'P_KP2D': 186.87, 'MPJPE': 31.75, 'PAMPJPE': 101.83, 'Pose': 185.32, 'Shape': 1.01, 'Prior': 62.31}^M
Epoch: [0][400/8267] Time 0.01 RUN 7.18 Lr 5e-05 Loss 1166.59 | Losses {'reg': 540.77, 'det': 629.41, 'CenterMap': 629.41, 'P_KP2D': 188.37, 'MPJPE': 30.73, 'PAMPJPE': 94.72, 'Pose': 177.78, 'Shape': 0.73, 'Prior': 48.44}^M
Epoch: [0][450/8267] Time 0.00 RUN 7.14 Lr 5e-05 Loss 1103.00 | Losses {'reg': 512.08, 'det': 595.68, 'CenterMap': 595.68, 'P_KP2D': 180.94, 'MPJPE': 27.43, 'PAMPJPE': 91.68, 'Pose': 168.15, 'Shape': 0.81, 'Prior': 43.08}^M
Epoch: [0][500/8267] Time 0.00 RUN 6.97 Lr 5e-05 Loss 1055.72 | Losses {'reg': 493.48, 'det': 562.24, 'CenterMap': 562.24, 'P_KP2D': 180.8, 'MPJPE': 24.87, 'PAMPJPE': 89.57, 'Pose': 157.76, 'Shape': 0.63, 'Prior': 39.85}^M
Epoch: [0][550/8267] Time 0.00 RUN 6.84 Lr 5e-05 Loss 1150.54 | Losses {'reg': 506.72, 'det': 704.26, 'CenterMap': 704.26, 'P_KP2D': 193.32, 'MPJPE': 25.85, 'PAMPJPE': 89.1, 'Pose': 160.67, 'Shape': 0.59, 'Prior': 37.18}^M
Epoch: [0][600/8267] Time 0.00 RUN 6.77 Lr 5e-05 Loss 1199.92 | Losses {'reg': 478.42, 'det': 769.55, 'CenterMap': 769.55, 'P_KP2D': 190.02, 'MPJPE': 26.57, 'PAMPJPE': 83.23, 'Pose': 145.25, 'Shape': 0.42, 'Prior': 32.94}^M
Epoch: [0][650/8267] Time 0.11 RUN 6.71 Lr 5e-05 Loss 1135.04 | Losses {'reg': 475.29, 'det': 719.67, 'CenterMap': 719.67, 'P_KP2D': 167.19, 'MPJPE': 26.2, 'PAMPJPE': 87.72, 'Pose': 159.28, 'Shape': 0.39, 'Prior': 34.5}^M
Epoch: [0][700/8267] Time 0.18 RUN 6.61 Lr 5e-05 Loss 1014.06 | Losses {'reg': 467.43, 'det': 554.8, 'CenterMap': 554.8, 'P_KP2D': 173.44, 'MPJPE': 19.27, 'PAMPJPE': 81.84, 'Pose': 149.95, 'Shape': 0.38, 'Prior': 42.55}^M
Epoch: [0][750/8267] Time 0.00 RUN 6.46 Lr 5e-05 Loss 989.16 | Losses {'reg': 463.46, 'det': 525.7, 'CenterMap': 525.7, 'P_KP2D': 177.85, 'MPJPE': 12.19, 'PAMPJPE': 81.6, 'Pose': 144.24, 'Shape': 0.24, 'Prior': 47.33}^M
Epoch: [0][800/8267] Time 0.00 RUN 6.37 Lr 5e-05 Loss 898.42 | Losses {'reg': 433.02, 'det': 465.4, 'CenterMap': 465.4, 'P_KP2D': 165.4, 'MPJPE': 10.45, 'PAMPJPE': 82.66, 'Pose': 139.56, 'Shape': 0.25, 'Prior': 34.7}^M
Epoch: [0][850/8267] Time 0.00 RUN 6.25 Lr 5e-05 Loss 900.65 | Losses {'reg': 464.36, 'det': 436.3, 'CenterMap': 436.3, 'P_KP2D': 183.43, 'MPJPE': 12.27, 'PAMPJPE': 84.78, 'Pose': 152.17, 'Shape': 0.21, 'Prior': 31.49}^M
Epoch: [0][900/8267] Time 0.00 RUN 6.15 Lr 5e-05 Loss 898.77 | Losses {'reg': 445.93, 'det': 452.84, 'CenterMap': 452.84, 'P_KP2D': 176.81, 'MPJPE': 14.85, 'PAMPJPE': 81.31, 'Pose': 147.72, 'Shape': 0.29, 'Prior': 24.96}^M
Epoch: [0][950/8267] Time 0.00 RUN 6.04 Lr 5e-05 Loss 868.26 | Losses {'reg': 438.97, 'det': 429.28, 'CenterMap': 429.28, 'P_KP2D': 172.96, 'MPJPE': 10.21, 'PAMPJPE': 82.15, 'Pose': 148.38, 'Shape': 0.29, 'Prior': 24.99}^M
Epoch: [0][1000/8267] Time 0.00 RUN 5.99 Lr 5e-05 Loss 837.78 | Losses {'reg': 405.88, 'det': 431.9, 'CenterMap': 431.9, 'P_KP2D': 171.49, 'MPJPE': 6.9, 'PAMPJPE': 80.47, 'Pose': 126.38, 'Shape': 0.18, 'Prior': 20.47}^M
Epoch: [0][1050/8267] Time 0.00 RUN 5.88 Lr 5e-05 Loss 849.21 | Losses {'reg': 421.76, 'det': 427.44, 'CenterMap': 427.44, 'P_KP2D': 181.92, 'MPJPE': 6.99, 'PAMPJPE': 80.41, 'Pose': 129.74, 'Shape': 0.16, 'Prior': 22.55}^M
Epoch: [0][1100/8267] Time 0.00 RUN 5.80 Lr 5e-05 Loss 833.58 | Losses {'reg': 401.17, 'det': 432.41, 'CenterMap': 432.41, 'P_KP2D': 171.4, 'MPJPE': 7.64, 'PAMPJPE': 76.53, 'Pose': 125.72, 'Shape': 0.18, 'Prior': 19.7}^M
Epoch: [0][1150/8267] Time 0.00 RUN 5.72 Lr 5e-05 Loss 818.64 | Losses {'reg': 396.52, 'det': 422.12, 'CenterMap': 422.12, 'P_KP2D': 173.61, 'MPJPE': 7.98, 'PAMPJPE': 77.98, 'Pose': 117.12, 'Shape': 0.34, 'Prior': 19.49}
Epoch: [0][1200/8267] Time 0.00 RUN 5.65 Lr 5e-05 Loss 833.43 | Losses {'reg': 407.27, 'det': 426.16, 'CenterMap': 426.16, 'P_KP2D': 175.89, 'MPJPE': 7.49, 'PAMPJPE': 81.3, 'Pose': 124.43, 'Shape': 0.23, 'Prior': 17.93}^M
Epoch: [0][1250/8267] Time 0.01 RUN 5.57 Lr 5e-05 Loss 807.46 | Losses {'reg': 391.34, 'det': 416.11, 'CenterMap': 416.11, 'P_KP2D': 173.32, 'MPJPE': 6.31, 'PAMPJPE': 80.11, 'Pose': 115.83, 'Shape': 0.07, 'Prior': 15.71}^M
Epoch: [0][1300/8267] Time 0.00 RUN 5.50 Lr 5e-05 Loss 798.61 | Losses {'reg': 376.55, 'det': 433.71, 'CenterMap': 433.71, 'P_KP2D': 166.42, 'MPJPE': 5.81, 'PAMPJPE': 80.0, 'Pose': 109.49, 'Shape': 0.1, 'Prior': 14.72}^M
Epoch: [0][1350/8267] Time 0.01 RUN 5.42 Lr 5e-05 Loss 782.21 | Losses {'reg': 367.99, 'det': 414.22, 'CenterMap': 414.22, 'P_KP2D': 164.64, 'MPJPE': 2.48, 'PAMPJPE': 74.81, 'Pose': 111.22, 'Shape': 0.04, 'Prior': 14.8}^M
Epoch: [0][1400/8267] Time 0.00 RUN 5.36 Lr 5e-05 Loss 780.90 | Losses {'reg': 375.85, 'det': 405.05, 'CenterMap': 405.05, 'P_KP2D': 168.36, 'MPJPE': 5.06, 'PAMPJPE': 78.04, 'Pose': 111.69, 'Shape': 0.12, 'Prior': 12.58}^M
Epoch: [0][1450/8267] Time 0.00 RUN 5.30 Lr 5e-05 Loss 760.84 | Losses {'reg': 370.86, 'det': 389.97, 'CenterMap': 389.97, 'P_KP2D': 169.66, 'MPJPE': 6.39, 'PAMPJPE': 73.19, 'Pose': 108.51, 'Shape': 0.2, 'Prior': 12.93}^M
Epoch: [0][1500/8267] Time 0.00 RUN 5.23 Lr 5e-05 Loss 784.04 | Losses {'reg': 384.01, 'det': 400.03, 'CenterMap': 400.03, 'P_KP2D': 169.96, 'MPJPE': 4.41, 'PAMPJPE': 75.87, 'Pose': 122.02, 'Shape': 0.12, 'Prior': 11.63}^M
Epoch: [0][1550/8267] Time 0.00 RUN 5.20 Lr 5e-05 Loss 763.32 | Losses {'reg': 371.05, 'det': 392.27, 'CenterMap': 392.27, 'P_KP2D': 172.4, 'MPJPE': 7.49, 'PAMPJPE': 72.31, 'Pose': 108.1, 'Shape': 0.15, 'Prior': 10.61}^M
Epoch: [0][1600/8267] Time 0.00 RUN 5.15 Lr 5e-05 Loss 735.87 | Losses {'reg': 342.51, 'det': 393.35, 'CenterMap': 393.35, 'P_KP2D': 158.11, 'MPJPE': 5.19, 'PAMPJPE': 67.91, 'Pose': 102.07, 'Shape': 0.13, 'Prior': 9.1}^M
Epoch: [0][1650/8267] Time 0.07 RUN 5.11 Lr 5e-05 Loss 733.45 | Losses {'reg': 344.86, 'det': 388.59, 'CenterMap': 388.59, 'P_KP2D': 161.34, 'MPJPE': 6.69, 'PAMPJPE': 71.88, 'Pose': 95.92, 'Shape': 0.14, 'Prior': 8.88}^M
Epoch: [0][1700/8267] Time 0.00 RUN 5.08 Lr 5e-05 Loss 753.90 | Losses {'reg': 365.54, 'det': 388.37, 'CenterMap': 388.37, 'P_KP2D': 168.25, 'MPJPE': 7.6, 'PAMPJPE': 74.03, 'Pose': 106.12, 'Shape': 0.17, 'Prior': 9.38}^M
Epoch: [0][1750/8267] Time 0.02 RUN 5.05 Lr 5e-05 Loss 738.18 | Losses {'reg': 354.19, 'det': 383.99, 'CenterMap': 383.99, 'P_KP2D': 161.62, 'MPJPE': 9.53, 'PAMPJPE': 70.1, 'Pose': 104.61, 'Shape': 0.21, 'Prior': 8.11}^M
Epoch: [0][1800/8267] Time 0.00 RUN 5.02 Lr 5e-05 Loss 752.35 | Losses {'reg': 357.05, 'det': 395.31, 'CenterMap': 395.31, 'P_KP2D': 169.51, 'MPJPE': 9.58, 'PAMPJPE': 69.42, 'Pose': 101.2, 'Shape': 0.25, 'Prior': 7.09}^M
Epoch: [0][1850/8267] Time 0.00 RUN 4.99 Lr 5e-05 Loss 732.98 | Losses {'reg': 360.17, 'det': 372.81, 'CenterMap': 372.81, 'P_KP2D': 172.06, 'MPJPE': 9.42, 'PAMPJPE': 71.84, 'Pose': 100.04, 'Shape': 0.26, 'Prior': 6.56}^M
Epoch: [0][1900/8267] Time 0.00 RUN 4.96 Lr 5e-05 Loss 733.62 | Losses {'reg': 366.57, 'det': 367.05, 'CenterMap': 367.05, 'P_KP2D': 178.43, 'MPJPE': 11.32, 'PAMPJPE': 70.5, 'Pose': 99.86, 'Shape': 0.28, 'Prior': 6.18}^M
Epoch: [0][1950/8267] Time 0.00 RUN 4.95 Lr 5e-05 Loss 726.75 | Losses {'reg': 354.36, 'det': 372.39, 'CenterMap': 372.39, 'P_KP2D': 161.53, 'MPJPE': 11.26, 'PAMPJPE': 72.12, 'Pose': 103.65, 'Shape': 0.25, 'Prior': 5.54}^M
Epoch: [0][2000/8267] Time 0.00 RUN 4.96 Lr 5e-05 Loss 715.37 | Losses {'reg': 343.48, 'det': 371.9, 'CenterMap': 371.9, 'P_KP2D': 157.19, 'MPJPE': 8.88, 'PAMPJPE': 68.88, 'Pose': 102.92, 'Shape': 0.19, 'Prior': 5.41}
0/104
+-----------+--------+----------+
|   DS/EM   | MPJPE  | PA_MPJPE |
+-----------+--------+----------+
| pw3d_vibe | 350.25 |  238.61  |
+-----------+--------+----------+
--------------------
16/104
+-----------+--------+----------+
|   DS/EM   | MPJPE  | PA_MPJPE |
+-----------+--------+----------+
| pw3d_vibe | 315.36 |  194.26  |
+-----------+--------+----------+
--------------------
32/104
+-----------+--------+----------+
|   DS/EM   | MPJPE  | PA_MPJPE |
+-----------+--------+----------+
| pw3d_vibe | 339.53 |  185.75  |
+-----------+--------+----------+
--------------------
48/104
+-----------+--------+----------+
|   DS/EM   | MPJPE  | PA_MPJPE |
+-----------+--------+----------+
| pw3d_vibe | 356.81 |  181.42  |
+-----------+--------+----------+
--------------------
64/104
+-----------+--------+----------+
|   DS/EM   | MPJPE  | PA_MPJPE |
+-----------+--------+----------+
| pw3d_vibe | 348.76 |  177.84  |
+-----------+--------+----------+
--------------------
80/104
+-----------+--------+----------+
|   DS/EM   | MPJPE  | PA_MPJPE |
+-----------+--------+----------+
| pw3d_vibe | 338.10 |  172.33  |
+-----------+--------+----------+
--------------------
96/104
+-----------+--------+----------+
|   DS/EM   | MPJPE  | PA_MPJPE |
+-----------+--------+----------+
| pw3d_vibe | 326.88 |  164.06  |
+-----------+--------+----------+
--------------------
/z/home/mkhoshle/ROMP-old/romp/lib/models/../utils/../maps_utils/result_parser.py:71: UserWarning:

__floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').

/z/home/mkhoshle/ROMP-old/romp/lib/models/../utils/../maps_utils/result_parser.py:71: UserWarning:

__floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').

/z/home/mkhoshle/ROMP-old/romp/lib/models/../utils/../maps_utils/result_parser.py:71: UserWarning:

__floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').

/z/home/mkhoshle/ROMP-old/romp/lib/models/../utils/../maps_utils/result_parser.py:71: UserWarning:

__floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').

Traceback (most recent call last):
  File "/z/home/mkhoshle/env/romp2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/z/home/mkhoshle/env/romp2/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/z/home/mkhoshle/ROMP-old/romp/train.py", line 182, in <module>
    main()
  File "/z/home/mkhoshle/ROMP-old/romp/train.py", line 179, in main
    trainer.train()
  File "/z/home/mkhoshle/ROMP-old/romp/train.py", line 57, in train
    self.train_epoch(epoch)
  File "/z/home/mkhoshle/ROMP-old/romp/train.py", line 132, in train_epoch
    self.validation(epoch)
  File "/z/home/mkhoshle/ROMP-old/romp/train.py", line 147, in validation
    MPJPE, PA_MPJPE, eval_results = val_result(self,loader_val=val_loader, evaluation=False)
  File "/z/home/mkhoshle/env/romp2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/z/home/mkhoshle/ROMP-old/romp/eval.py", line 25, in val_result
    outputs = self.network_forward(eval_model, meta_data, self.eval_cfg)
  File "/z/home/mkhoshle/ROMP-old/romp/base.py", line 119, in network_forward
    outputs = model(meta_data, **cfg_dict)
  File "/z/home/mkhoshle/env/romp2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/z/home/mkhoshle/env/romp2/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/z/home/mkhoshle/env/romp2/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/z/home/mkhoshle/env/romp2/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/z/home/mkhoshle/env/romp2/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
TypeError: Caught TypeError in replica 7 on device 7.
Original Traceback (most recent call last):
  File "/z/home/mkhoshle/env/romp2/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/z/home/mkhoshle/env/romp2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'meta_data'
I think this error happens when dividing the data across processes. However, I am not really sure what would be the solution for this. Have you encountered this error before?
Thanks in advance,
Aug 13 '22 22:08 mkhoshle
Sorry for the late reply. @mkhoshle Yes, I have met this issue before. The problem is that some gpus don't have the meta_data input. The problem is caused by the DataParallel module when it intend to divide a batch data into multiple splits. It is highly possible that the bug is triggered when the last batch could not evenly splited into multiple gpus. For instance, when the last batch has 3 samples, while we use 4 gpus during evaluation. Then the last gpu doesn't get its 'meta_data' as input, which trigged this bug.
Sep 05 '22 02:09 Arthur151
@Arthur151 Yes I had the same understanding. Do you have any resolution? How can I avoid this error? How did you guys avoid it when using data parallel?
Sep 05 '22 14:09 mkhoshle
During evaluation along with the training process, if would be fine to set drop_last to avoid this. During formal evaluation phrase, we can use only one gpu for evaluation to avoid this.
Sep 07 '22 01:09 Arthur151
@Arthur151 Thanks! It works with this solution!
Sep 30 '22 19:09 mkhoshle