FCHarDNet icon indicating copy to clipboard operation
FCHarDNet copied to clipboard

"CUDA out of memory." at training.

Open sounansu opened this issue 6 years ago • 8 comments

Hi! I try to train with this command.(At my windows PC with RTX2070)

F:\Users\sounansu\Anaconda3\FCHarDNet>\python train.py --config configs\hardnet.yml
.....
RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 8.00 GiB total capacity; 5.98 GiB already allocated; 24.97 MiB free; 30.09 MiB cached)

Please teach me how to modify hardnet.yml!

sounansu avatar Nov 26 '19 14:11 sounansu

Hello, thanks for reaching out. The current config requires 20GB GPU memory for training. For single GPU training with 11~12GB memory, you can try modifying the image resolution from [1024, 1024] to [768, 768], which may compromise the mIoU down to ~0.76 (val) however. Please note that you will need to modify img_rows, img_cols, and rscale_crop.

PingoLH avatar Nov 26 '19 15:11 PingoLH

Thank you @PingoLH ! I will try to train with modified hardnet.yml. And, will report mIoU value with that parameter.

sounansu avatar Nov 27 '19 14:11 sounansu

I changed hardnet.yml as img_rows: 512 img_cols: 512 and started to train. But, at train iteration of 500,

Iter [500/90000]  Loss: 1.2908  Time/Image: 0.0273  lr=0.019900
INFO:ptsemseg:Iter [500/90000]  Loss: 1.2908  Time/Image: 0.0273  lr=0.019900
1it [00:15, 15.10s/it]Traceback (most recent call last):
  File "train.py", line 267, in 
    train(cfg, writer, logger)
  File "train.py", line 186, in train
    outputs = model(images_val)
  File "F:\Users\sounansu\Anaconda3New\envs\FCHarDNet\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "F:\Users\sounansu\Anaconda3New\envs\FCHarDNet\lib\site-packages\torch\nn\parallel\data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "F:\Users\sounansu\Anaconda3New\envs\FCHarDNet\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "F:\Users\sounansu\Anaconda3New\FCHarDNet\ptsemseg\models\hardnet.py", line 186, in forward
    out = self.transUpBlocks[i](out, skip, True)
  File "F:\Users\sounansu\Anaconda3New\envs\FCHarDNet\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "F:\Users\sounansu\Anaconda3New\FCHarDNet\ptsemseg\models\hardnet.py", line 99, in forward
    out = torch.cat([out, skip], 1)
RuntimeError: CUDA out of memory. Tried to allocate 1008.00 MiB (GPU 0; 8.00 GiB total capacity; 4.28 GiB already allocated; 796.97 MiB free; 998.90 MiB cached)
1it [00:19, 19.59s/it]

Out of memory was occurred.

sounansu avatar Nov 27 '19 22:11 sounansu

I modified hardnet.yml and train.py as below

diff --git a/configs/hardnet.yml b/configs/hardnet.yml
index e3e14a6..bfac3bf 100644
--- a/configs/hardnet.yml
+++ b/configs/hardnet.yml
@@ -4,10 +4,10 @@ data:
        dataset: cityscapes
        train_split: train
        val_split: val
-    img_rows: 1024
-    img_cols: 1024
-    path: /mnt/ssd2/Cityscapes/
-    sbd_path: /mnt/ssd2/Cityscapes/
+    img_rows: 512
+    img_cols: 512
+    path: F:\image_data\Cityscape\leftImg8bit
+    sbd_path: F:\image_data\Cityscape\leftImg8bit
 training:
        train_iters: 90000
        batch_size: 16
@@ -16,7 +16,7 @@ training:
        print_interval: 10
        augmentations:
                hflip: 0.5
-        rscale_crop: [1024, 1024]
+        rscale_crop: [512, 512]
        optimizer:
                name: 'sgd'
                lr: 0.02
diff --git a/train.py b/train.py
index 172e917..57746e6 100644
--- a/train.py
+++ b/train.py
@@ -57,7 +57,7 @@ def train(cfg, writer, logger):
                data_path,
                is_transform=True,
                split=cfg["data"]["val_split"],
-        img_size=(1024,2048),
+        img_size=(cfg["data"]["img_rows"], cfg["data"]["img_cols"]),
        )

        n_classes = t_loader.n_classes

and train again.

I measure validation.

(FCHarDNet) F:\Users\sounansu\Anaconda3New\FCHarDNet>python validate.py --config configs\hardnet.yml --model_path runs\hardnet\cur\hardnet_cityscapes_best_model.pkl
....
Total Frame Rate = 33.85 fps
Overall Acc:     0.9560427663307451
Mean Acc :       0.8086355461508691
FreqW Acc :      0.9193348521709037
Mean IoU :       0.7240548654125439
0 0.9793078776886329
1 0.8371592154910068
2 0.918241383070455
3 0.566010215830134
4 0.579013853663639
5 0.6087141064956788
6 0.6494385999487208
7 0.7526571660539968
8 0.9192169494945195
9 0.6232992244377927
10 0.9399414673242146
11 0.7892767003086372
12 0.5512253643936255
13 0.9420434722036333
14 0.6960866120115173
15 0.7649974688063472
16 0.41760177053657616
17 0.4881134901126577
18 0.7346975049665497

sounansu avatar Nov 30 '19 01:11 sounansu

Hi sounansu, thank you so much for the feedback and report. You can also try a smaller batch size with a higher resolution if you are interested. Also, I'll recommend keeping the full resolution for v_loader with a smaller batch size such that the image will not be distorted during validation. Thanks!

PingoLH avatar Nov 30 '19 18:11 PingoLH

Thank you another advice.

So. I modified batch size as below.

-    batch_size: 16
+    batch_size: 4

Validation values are

Total Frame Rate = 34.36 fps
Overall Acc:     0.9550504199267023
Mean Acc :       0.8234178483937832
FreqW Acc :      0.9176348176979007
Mean IoU :       0.728196572228794
0 0.9797917075564525
1 0.8329137830933465
2 0.9146849932283128
3 0.5196456033824945
4 0.5477201100625065
5 0.6087271166747225
6 0.6399322002349301
7 0.7507402719029521
8 0.9204021515861227
9 0.6014830162332917
10 0.9410106909056067
11 0.7900416534638349
12 0.5772749706560611
13 0.9361805056874339
14 0.6014725882823072
15 0.7831162906176867
16 0.6354351352905586
17 0.5277508688827295
18 0.7274112146057334
It is little better mIoU than older!
Thank you!

sounansu avatar Dec 02 '19 22:12 sounansu

您好sounansu,非常感谢您的反馈和报告。如果您有兴趣,也可以尝试使用较小的批次大小和较高的分辨率。另外,我建议使用较小的批处理大小保留v_loader的完整分辨率,以使图像在验证期间不会失真。谢谢!

Could you please tell me the configuration version of scipy package for training?In a Linux environment?How about under Windows?

18022443868 avatar Mar 06 '20 10:03 18022443868

1024,2048

Please check out your version of Windows pytorch and other package versions

18022443868 avatar Mar 06 '20 10:03 18022443868