mmdetection icon indicating copy to clipboard operation
mmdetection copied to clipboard

Mask2Former training with V100 instead of A100 GPUs

Open jackkwok opened this issue 2 years ago • 4 comments

Hi,

Mask2Former Training Resources says 8x A100 GPUs:

Collections:
  - Name: Mask2Former
    Metadata:
      Training Data: COCO
      Training Techniques:
        - AdamW
        - Weight Decay
      Training Resources: 8x A100 GPUs

What changes (e.g. smaller batch size per GPU) are needed if I use V100 instead of A100 GPUs?

Note: V100 has only 16GB GPU ram but A100 has 40GB or 80GB ram.

jackkwok avatar Jul 12 '22 21:07 jackkwok

Hi,

I tried (the only config change I made was pointing to my own custom coco JSON files):

./tools/dist_train.sh configs/mask2former/my_custom_mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.py 8

For the last 30 min, it is stuck like the below screen. The CPU using is at 100% for 8 processes. and the GPU (V100) usage is at 0%. What could be wrong?

loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
index created!
index created!
index created!
index created!
index created!
index created!
index created!
2022-07-12 23:27:41.991601: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

My environment info:

2022-07-12 23:27:12,987 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.0, V11.0.194
GCC: gcc (Debian 8.3.0-6) 8.3.0
PyTorch: 1.6.0a0+9907a3e
PyTorch compiling details: PyTorch built with:
  - GCC 7.5
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.0
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_80,code=compute_80
  - CuDNN 8.0
    - Built with CuDNN 8.0.1
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.7.0a0
OpenCV: 4.4.0
MMCV: 1.6.0
MMCV Compiler: GCC 8.3
MMCV CUDA Compiler: 11.0
MMDetection: 2.25.0+d4fa2b4
------------------------------------------------------------

jackkwok avatar Jul 13 '22 00:07 jackkwok

@jackkwok Did you find the issue?

roboserg avatar Jul 30 '22 01:07 roboserg

@roboserg : No actually, do you have the same issue?

jackkwok avatar Jul 30 '22 01:07 jackkwok

@jackkwok You can use batch_size = 8 * 1 and scale the lr due to V100 16G. For batch_size=8 * 2, 19 GB is required, see https://github.com/open-mmlab/mmdetection/blob/3b72b12fe9b14de906d1363982b9fba05e7d47c1/configs/mask2former/metafile.yml#L212

chhluo avatar Aug 04 '22 06:08 chhluo

@roboserg : No actually, do you have the same issue?

Yes, I have a similar issue where the Mask2Former model get's stuck like in your case at the beginning of training and does not train. MASK-RCNN works, Mask2Former - gets stuck. I have no idea how to debug the issue.

roboserg avatar Aug 09 '22 00:08 roboserg

@roboserg : Thanks for letting me know. It's good to know I am not the only one facing the issue. Do you mind posting your environment info?

jackkwok avatar Aug 09 '22 05:08 jackkwok

There is env info in the beginning of log file, see https://download.openmmlab.com/mmdetection/v2.0/mask2former/mask2former_r50_lsj_8x2_50e_coco-panoptic/mask2former_r50_lsj_8x2_50e_coco-panoptic_20220326_224516.log.json

chhluo avatar Aug 09 '22 09:08 chhluo

Did anyone find a solution for this?

Amr-Mustafa avatar Feb 10 '23 13:02 Amr-Mustafa

Did anyone find a solution for this?

I switched to Detectron2, works great

Robotatron avatar Apr 12 '23 14:04 Robotatron