Scene-Graph-Benchmark.pytorch Model can not be evaluated with USE_GT

Model can not be evaluated with USE_GT_BOX set to False

Open nullkatar opened this issue 4 years ago • 1 comments

🐛 Bug

SGDet model (with and without attributes, tried both of them) can not be evaluated. It can be trained (with SOLVER.PRE_VAL False turned on). Also I tired switching on and off these two arguments: MODEL.ROI_RELATION_HEAD.USE_GT_BOX, MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL and discovered that without using GT_BOX (set to True) I can not evaluate model.

To Reproduce

Steps to reproduce the behavior:

1.Train FasterRCNN using the following command: CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 10001 --nproc_per_node=1 tools/detector_pretrain_net.py --config-file "configs/e2e_relation_detector_X_101_32_8_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 1 TEST.IMS_PER_BATCH 1 DTYPE "float16" SOLVER.MAX_ITER 100000 SOLVER.STEPS "(30000, 45000)" SOLVER.VAL_PERIOD 20000 SOLVER.CHECKPOINT_PERIOD 20000 MODEL.RELATION_ON False OUTPUT_DIR ./pretrained_faster_rcnn_with_att SOLVER.PRE_VAL False MODEL.PRETRAINED_DETECTOR_CKPT ./pretrained_faster_rcnn/model_final.pth 2. Train SGDet CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 10025 --nproc_per_node=1 tools/relation_test_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.USE_GT_BOX False MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False MODEL.ROI_RELATION_HEAD.PREDICTOR MotifPredictor SOLVER.IMS_PER_BATCH 1 TEST.IMS_PER_BATCH 1 DTYPE "float16" SOLVER.MAX_ITER 100000 SOLVER.VAL_PERIOD 20000 SOLVER.CHECKPOINT_PERIOD 20000 GLOVE_DIR ./ MODEL.PRETRAINED_DETECTOR_CKPT ./pretrained_faster_rcnn_with_att/model_final.pth OUTPUT_DIR ./SG_trained

Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
2021-04-04 05:47:08,110 maskrcnn_benchmark.utils.checkpoint INFO: Loading checkpoint from ./GQA_SG_two_falses/model_0045200.pth
2021-04-04 05:47:09,157 maskrcnn_benchmark.inference INFO: Start evaluation on VG_stanford_filtered_with_attribute_test dataset(10252 images).
  0%|                                                                                                                                                                                                                                              | 0/10252 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "tools/relation_test_net.py", line 112, in <module>
    main()
  File "tools/relation_test_net.py", line 106, in main
    output_folder=output_folder,
  File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/engine/inference.py", line 110, in inference
    predictions = compute_on_dataset(model, data_loader, device, synchronize_gather=cfg.TEST.RELATION.SYNC_GATHER, timer=inference_timer)
  File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/engine/inference.py", line 34, in compute_on_dataset
    output = model(images.to(device), targets)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward
    x, result, detector_losses = self.roi_heads(features, proposals, targets, logger)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 69, in forward
    x, detections, loss_relation = self.relation(features, detections, targets, logger)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/relation_head.py", line 74, in forward
    union_features = self.union_feature_extractor(features, proposals, rel_pair_idxs)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/roi_relation_feature_extractors.py", line 90, in forward
    rect_features = self.rect_conv(rect_inputs)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/amp/wrap.py", line 28, in wrapper
    return orig_fn(*new_args, **kwargs)
RuntimeError: CUDA out of memory. Tried to allocate 6.48 GiB (GPU 0; 10.76 GiB total capacity; 8.40 GiB already allocated; 1.49 GiB free; 8.50 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)

Environment

PyTorch version: 1.4.0 Is debug build: False CUDA used to build PyTorch: 10.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: 10.0.0-4ubuntu1 CMake version: version 3.16.3

Python version: 3.6 (64-bit runtime) Is CUDA available: True CUDA runtime version: 11.2.67 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti Nvidia driver version: 460.39 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.19.2 [pip3] torch==1.4.0 [pip3] torchvision==0.5.0 [conda] blas 1.0 mkl [conda] cudatoolkit 10.1.243 h6bb024c_0 [conda] mkl 2020.2 256 [conda] mkl-service 2.3.0 py36he8ac12f_0 [conda] mkl_fft 1.2.0 py36h23d657b_0 [conda] mkl_random 1.1.1 py36h0573a6f_0 [conda] numpy 1.19.2 py36h54aff64_0 [conda] numpy-base 1.19.2 py36hfa32c7d_0 [conda] pytorch 1.4.0 py3.6_cuda10.1.243_cudnn7.6.3_0 pytorch [conda] torchvision 0.5.0 py36_cu101 pytorch

Additional context

I tired running this model with different CUDA and Python versions, but result is still the same for everything.

Apr 04 '21 13:04 nullkatar

I believe this might be caused by too many proposals being sampled (a higher number than what is specified in the config file).

As a result, during testing, the number of union regions that is passed to rect_features = self.rect_conv(rect_inputs) could be much higher than 80*79 (if the maximum number of detections is set to 80), resulting in the OOM error.

This might be caused, if all/too many detections have the same confidence score. In that case, keep could filter out more than the intended max. 80 proposals during postprocessing: https://github.com/KaihuaTang/Scene-Graph-Benchmark.pytorch/blob/d0ffa40d92133d7d865e531146de82c8c8a344c0/maskrcnn_benchmark/modeling/roi_heads/box_head/inference.py#L224

You could check out the dimension of rect_inputs and compare it with the specified maximum number of detections to verify if this is the case.

Jul 16 '21 12:07 j-rausch

Scene-Graph-Benchmark.pytorch Scene-Graph-Benchmark.pytorch copied to clipboard

Model can not be evaluated with USE_GT_BOX set to False

🐛 Bug

To Reproduce

Environment

Additional context

Scene-Graph-Benchmark.pytorch
Scene-Graph-Benchmark.pytorch copied to clipboard