mmdetection Mask2Former training with V100 instead of A100 GPUs

Hi,

Mask2Former Training Resources says 8x A100 GPUs:

Collections:
  - Name: Mask2Former
    Metadata:
      Training Data: COCO
      Training Techniques:
        - AdamW
        - Weight Decay
      Training Resources: 8x A100 GPUs

What changes (e.g. smaller batch size per GPU) are needed if I use V100 instead of A100 GPUs?

Note: V100 has only 16GB GPU ram but A100 has 40GB or 80GB ram.

Jul 12 '22 21:07 jackkwok

Hi,

I tried (the only config change I made was pointing to my own custom coco JSON files):

./tools/dist_train.sh configs/mask2former/my_custom_mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.py 8

For the last 30 min, it is stuck like the below screen. The CPU using is at 100% for 8 processes. and the GPU (V100) usage is at 0%. What could be wrong?

loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
index created!
index created!
index created!
index created!
index created!
index created!
index created!
2022-07-12 23:27:41.991601: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

My environment info:

2022-07-12 23:27:12,987 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.0, V11.0.194
GCC: gcc (Debian 8.3.0-6) 8.3.0
PyTorch: 1.6.0a0+9907a3e
PyTorch compiling details: PyTorch built with:
  - GCC 7.5
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.0
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_80,code=compute_80
  - CuDNN 8.0
    - Built with CuDNN 8.0.1
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.7.0a0
OpenCV: 4.4.0
MMCV: 1.6.0
MMCV Compiler: GCC 8.3
MMCV CUDA Compiler: 11.0
MMDetection: 2.25.0+d4fa2b4
------------------------------------------------------------

Jul 13 '22 00:07 jackkwok

@jackkwok Did you find the issue?

Jul 30 '22 01:07 roboserg

@roboserg : No actually, do you have the same issue?

Jul 30 '22 01:07 jackkwok

@jackkwok You can use batch_size = 8 * 1 and scale the lr due to V100 16G. For batch_size=8 * 2, 19 GB is required, see https://github.com/open-mmlab/mmdetection/blob/3b72b12fe9b14de906d1363982b9fba05e7d47c1/configs/mask2former/metafile.yml#L212

Aug 04 '22 06:08 chhluo

@roboserg : No actually, do you have the same issue?

Yes, I have a similar issue where the Mask2Former model get's stuck like in your case at the beginning of training and does not train. MASK-RCNN works, Mask2Former - gets stuck. I have no idea how to debug the issue.

Aug 09 '22 00:08 roboserg

@roboserg : Thanks for letting me know. It's good to know I am not the only one facing the issue. Do you mind posting your environment info?

Aug 09 '22 05:08 jackkwok

There is env info in the beginning of log file, see https://download.openmmlab.com/mmdetection/v2.0/mask2former/mask2former_r50_lsj_8x2_50e_coco-panoptic/mask2former_r50_lsj_8x2_50e_coco-panoptic_20220326_224516.log.json

Aug 09 '22 09:08 chhluo

Did anyone find a solution for this?

Feb 10 '23 13:02 Amr-Mustafa

Did anyone find a solution for this?

I switched to Detectron2, works great

Apr 12 '23 14:04 Robotatron

mmdetection mmdetection copied to clipboard

Mask2Former training with V100 instead of A100 GPUs

mmdetection
mmdetection copied to clipboard