mmsegmentation Problem encountered when using fast-scnn

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug I try to build the config file of the fast-fscnn with rescuenet dataset, but when calculating accuracy, I encountered a CUDA error. What confuses me is that using the same dataset, I use the same method to build the config files of deeplabv3p, pspnet and mobilenetv3 ,they can all be executes successfully.

Reproduction

What command or script did you run?

python tools/train.py my_configs/rescuenet_fast-scnn.py

Did you make any modifications on the code or config? Did you understand what you have modified? The following is the files I wrote myself, I have not modified the relevant config files of fast-scnn. ./configs/base/datasets/rescuenet.py

rescuenet.txt

./mmseg/datasets/rescuenet.py

rescuenet.txt

./my_configs/config.py

config.txt

What dataset did you use? Rescuenet dataset, a post-disaster UAV dataset. I converted the dataset into the following file structure, and converted them all into a size of 3000 * 4000 (H * W). https://www.kaggle.com/datasets/yaroslavchyrko/rescuenet 📁 RescueNet/ ├─📁 ann_dir/ │ ├─📁 test/ │ ├─📁 train/ │ └─📁 val/ ├─📁 img_dir/ │ ├─📁 test/ │ ├─📁 train/ │ └─📁 val/ └─📄 RescueNet-DATASET-VERSION-NOTE-v1.0.txt

Environment

Please run python mmseg/utils/collect_env.py to collect necessary environment information and paste it here.

sys.platform: linux Python: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0: Tesla P100-PCIE-16GB CUDA_HOME: /data/CUDA/cuda-11.7 NVCC: Cuda compilation tools, release 11.7, V11.7.64 GCC: gcc (GCC) 5.4.0 PyTorch: 1.10.1+cu113 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.2+cu113 OpenCV: 4.8.0 MMEngine: 0.8.4 MMSegmentation: 1.1.1+30a3f94

You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

The configuration method is as follow:

https://github.com/TommyZihao/MMSegmentation_Tutorials/blob/main/20230816/%E3%80%90A1%E3%80%91%E5%AE%89%E8%A3%85%E9%85%8D%E7%BD%AEMMSegmentation.ipynb

Error traceback

If applicable, paste the error trackback here.

Traceback (most recent call last):
  File "tools/train.py", line 104, in <module>
    main()
  File "tools/train.py", line 100, in main
    runner.train()
  File "/home/gjc23/anaconda3/envs/pytorch/lib/python3.7/site-packages/mmengine/runner/runner.py", line 1745, in train
    model = self.train_loop.run()  # type: ignore
  File "/home/gjc23/anaconda3/envs/pytorch/lib/python3.7/site-packages/mmengine/runner/loops.py", line 278, in run
    self.run_iter(data_batch)
  File "/home/gjc23/anaconda3/envs/pytorch/lib/python3.7/site-packages/mmengine/runner/loops.py", line 302, in run_iter
    data_batch, optim_wrapper=self.runner.optim_wrapper)
  File "/home/gjc23/anaconda3/envs/pytorch/lib/python3.7/site-packages/mmengine/model/base_model/base_model.py", line 114, in train_step
    losses = self._run_forward(data, mode='loss')  # type: ignore
  File "/home/gjc23/anaconda3/envs/pytorch/lib/python3.7/site-packages/mmengine/model/base_model/base_model.py", line 340, in _run_forward
    results = self(**data, mode=mode)
  File "/home/gjc23/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/gjc23/mmsegmentation/mmseg/models/segmentors/base.py", line 94, in forward
    return self.loss(inputs, data_samples)
  File "/data1/gjc23/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 178, in loss
    loss_decode = self._decode_head_forward_train(x, data_samples)
  File "/data1/gjc23/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 140, in _decode_head_forward_train
    self.train_cfg)
  File "/data1/gjc23/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 262, in loss
    losses = self.loss_by_feat(seg_logits, batch_data_samples)
  File "/data1/gjc23/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 337, in loss_by_feat
    seg_logits, seg_label, ignore_index=self.ignore_index)
  File "/data1/gjc23/mmsegmentation/mmseg/models/losses/accuracy.py", line 49, in accuracy
    correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Bug fix The error says an arror occurred while calculating accuracy, but other networks calculate accuracy correctly, even if accuracy is 0. Looking forward to your reply, thank you!!!

Sep 17 '23 03:09 SummerTide

插眼，我是设置class_weight后，发生类似的错误

Nov 16 '23 12:11 longtimenoseeyou

代码貌似有点问题，请问你解决了吗 @SummerTide @longtimenoseeyou

Dec 27 '23 07:12 JH95-ai

代码貌似有点问题，请问你解决了吗 @SummerTide @longtimenoseeyou 你可以使用CUDA_VISIBLE_DEVICES=-1，来关闭gpu，用cpu运行你的程序可以看到具体的报错信息

Dec 27 '23 07:12 longtimenoseeyou

thanks

Dec 29 '23 01:12 JH95-ai