mmdetection
mmdetection copied to clipboard
Mask2Former training with V100 instead of A100 GPUs
Hi,
Mask2Former Training Resources says 8x A100 GPUs
:
Collections:
- Name: Mask2Former
Metadata:
Training Data: COCO
Training Techniques:
- AdamW
- Weight Decay
Training Resources: 8x A100 GPUs
What changes (e.g. smaller batch size per GPU) are needed if I use V100 instead of A100 GPUs?
Note: V100 has only 16GB GPU ram but A100 has 40GB or 80GB ram.
Hi,
I tried (the only config change I made was pointing to my own custom coco JSON files):
./tools/dist_train.sh configs/mask2former/my_custom_mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.py 8
For the last 30 min, it is stuck like the below screen. The CPU using is at 100% for 8 processes. and the GPU (V100) usage is at 0%. What could be wrong?
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
Done (t=0.45s)
creating index...
index created!
index created!
index created!
index created!
index created!
index created!
index created!
2022-07-12 23:27:41.991601: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
My environment info:
2022-07-12 23:27:12,987 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: Tesla V100-SXM2-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.0, V11.0.194
GCC: gcc (Debian 8.3.0-6) 8.3.0
PyTorch: 1.6.0a0+9907a3e
PyTorch compiling details: PyTorch built with:
- GCC 7.5
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.0
- NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_80,code=compute_80
- CuDNN 8.0
- Built with CuDNN 8.0.1
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.7.0a0
OpenCV: 4.4.0
MMCV: 1.6.0
MMCV Compiler: GCC 8.3
MMCV CUDA Compiler: 11.0
MMDetection: 2.25.0+d4fa2b4
------------------------------------------------------------
@jackkwok Did you find the issue?
@roboserg : No actually, do you have the same issue?
@jackkwok You can use batch_size = 8 * 1 and scale the lr due to V100 16G. For batch_size=8 * 2, 19 GB is required, see https://github.com/open-mmlab/mmdetection/blob/3b72b12fe9b14de906d1363982b9fba05e7d47c1/configs/mask2former/metafile.yml#L212
@roboserg : No actually, do you have the same issue?
Yes, I have a similar issue where the Mask2Former model get's stuck like in your case at the beginning of training and does not train. MASK-RCNN works, Mask2Former - gets stuck. I have no idea how to debug the issue.
@roboserg : Thanks for letting me know. It's good to know I am not the only one facing the issue. Do you mind posting your environment info?
There is env info in the beginning of log file, see https://download.openmmlab.com/mmdetection/v2.0/mask2former/mask2former_r50_lsj_8x2_50e_coco-panoptic/mask2former_r50_lsj_8x2_50e_coco-panoptic_20220326_224516.log.json
Did anyone find a solution for this?
Did anyone find a solution for this?
I switched to Detectron2, works great