Scene-Graph-Benchmark.pytorch
Scene-Graph-Benchmark.pytorch copied to clipboard
Model can not be evaluated with USE_GT_BOX set to False
🐛 Bug
SGDet model (with and without attributes, tried both of them) can not be evaluated. It can be trained (with SOLVER.PRE_VAL False turned on). Also I tired switching on and off these two arguments:
MODEL.ROI_RELATION_HEAD.USE_GT_BOX, MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL and discovered that without using GT_BOX (set to True) I can not evaluate model.
To Reproduce
Steps to reproduce the behavior:
1.Train FasterRCNN using the following command: CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 10001 --nproc_per_node=1 tools/detector_pretrain_net.py --config-file "configs/e2e_relation_detector_X_101_32_8_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 1 TEST.IMS_PER_BATCH 1 DTYPE "float16" SOLVER.MAX_ITER 100000 SOLVER.STEPS "(30000, 45000)" SOLVER.VAL_PERIOD 20000 SOLVER.CHECKPOINT_PERIOD 20000 MODEL.RELATION_ON False OUTPUT_DIR ./pretrained_faster_rcnn_with_att SOLVER.PRE_VAL False MODEL.PRETRAINED_DETECTOR_CKPT ./pretrained_faster_rcnn/model_final.pth
2. Train SGDet CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 10025 --nproc_per_node=1 tools/relation_test_net.py --config-file "configs/e2e_relation_X_101_32_8_FPN_1x.yaml" MODEL.ROI_RELATION_HEAD.USE_GT_BOX False MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL False MODEL.ROI_RELATION_HEAD.PREDICTOR MotifPredictor SOLVER.IMS_PER_BATCH 1 TEST.IMS_PER_BATCH 1 DTYPE "float16" SOLVER.MAX_ITER 100000 SOLVER.VAL_PERIOD 20000 SOLVER.CHECKPOINT_PERIOD 20000 GLOVE_DIR ./ MODEL.PRETRAINED_DETECTOR_CKPT ./pretrained_faster_rcnn_with_att/model_final.pth OUTPUT_DIR ./SG_trained
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
2021-04-04 05:47:08,110 maskrcnn_benchmark.utils.checkpoint INFO: Loading checkpoint from ./GQA_SG_two_falses/model_0045200.pth
2021-04-04 05:47:09,157 maskrcnn_benchmark.inference INFO: Start evaluation on VG_stanford_filtered_with_attribute_test dataset(10252 images).
0%| | 0/10252 [00:00<?, ?it/s]
Traceback (most recent call last):
File "tools/relation_test_net.py", line 112, in <module>
main()
File "tools/relation_test_net.py", line 106, in main
output_folder=output_folder,
File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/engine/inference.py", line 110, in inference
predictions = compute_on_dataset(model, data_loader, device, synchronize_gather=cfg.TEST.RELATION.SYNC_GATHER, timer=inference_timer)
File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/engine/inference.py", line 34, in compute_on_dataset
output = model(images.to(device), targets)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 52, in forward
x, result, detector_losses = self.roi_heads(features, proposals, targets, logger)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/roi_heads.py", line 69, in forward
x, detections, loss_relation = self.relation(features, detections, targets, logger)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/relation_head.py", line 74, in forward
union_features = self.union_feature_extractor(features, proposals, rel_pair_idxs)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/local-scratch/localhome/lkochiev/Documents/SFU/SGB/Scene-Graph-Benchmark.pytorch/maskrcnn_benchmark/modeling/roi_heads/relation_head/roi_relation_feature_extractors.py", line 90, in forward
rect_features = self.rect_conv(rect_inputs)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/apex-0.1-py3.6.egg/apex/amp/wrap.py", line 28, in wrapper
return orig_fn(*new_args, **kwargs)
RuntimeError: CUDA out of memory. Tried to allocate 6.48 GiB (GPU 0; 10.76 GiB total capacity; 8.40 GiB already allocated; 1.49 GiB free; 8.50 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/localhome/lkochiev/anaconda3/envs/sgb/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
Environment
PyTorch version: 1.4.0 Is debug build: False CUDA used to build PyTorch: 10.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: 10.0.0-4ubuntu1 CMake version: version 3.16.3
Python version: 3.6 (64-bit runtime) Is CUDA available: True CUDA runtime version: 11.2.67 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti Nvidia driver version: 460.39 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] numpy==1.19.2 [pip3] torch==1.4.0 [pip3] torchvision==0.5.0 [conda] blas 1.0 mkl [conda] cudatoolkit 10.1.243 h6bb024c_0 [conda] mkl 2020.2 256 [conda] mkl-service 2.3.0 py36he8ac12f_0 [conda] mkl_fft 1.2.0 py36h23d657b_0 [conda] mkl_random 1.1.1 py36h0573a6f_0 [conda] numpy 1.19.2 py36h54aff64_0 [conda] numpy-base 1.19.2 py36hfa32c7d_0 [conda] pytorch 1.4.0 py3.6_cuda10.1.243_cudnn7.6.3_0 pytorch [conda] torchvision 0.5.0 py36_cu101 pytorch
Additional context
I tired running this model with different CUDA and Python versions, but result is still the same for everything.
I believe this might be caused by too many proposals being sampled (a higher number than what is specified in the config file).
As a result, during testing, the number of union regions that is passed to rect_features = self.rect_conv(rect_inputs) could be much higher than 80*79 (if the maximum number of detections is set to 80), resulting in the OOM error.
This might be caused, if all/too many detections have the same confidence score.
In that case, keep could filter out more than the intended max. 80 proposals during postprocessing:
https://github.com/KaihuaTang/Scene-Graph-Benchmark.pytorch/blob/d0ffa40d92133d7d865e531146de82c8c8a344c0/maskrcnn_benchmark/modeling/roi_heads/box_head/inference.py#L224
You could check out the dimension of rect_inputs and compare it with the specified maximum number of detections to verify if this is the case.