yolact icon indicating copy to clipboard operation
yolact copied to clipboard

Loss Explosion?

Open Mogarbobac opened this issue 2 years ago • 4 comments

`Calculating mAP...

   |  all  |  .50  |  .55  |  .60  |  .65  |  .70  |  .75  |  .80  |  .85  |  .90  |  .95  |

-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ box | 54.24 | 85.73 | 84.12 | 81.72 | 76.59 | 70.50 | 63.46 | 50.60 | 25.02 | 4.50 | 0.15 | mask | 50.89 | 76.12 | 73.29 | 70.37 | 66.27 | 60.96 | 55.37 | 47.63 | 40.84 | 17.69 | 0.42 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

[549] 25810 || B: 0.957 | C: 1.192 | M: 1.161 | S: 0.029 | T: 3.340 || ETA: 9 days, 6:55:01 || timer: 0.656 [549] 25820 || B: 1.010 | C: 1.203 | M: 1.180 | S: 0.029 | T: 3.423 || ETA: 9 days, 5:27:12 || timer: 0.672 [549] 25830 || B: 0.984 | C: 1.172 | M: 1.171 | S: 0.029 | T: 3.356 || ETA: 9 days, 5:24:58 || timer: 0.687 [549] 25840 || B: 1.022 | C: 1.191 | M: 1.201 | S: 0.029 | T: 3.443 || ETA: 9 days, 5:21:45 || timer: 0.656 [550] 25850 || B: 1.040 | C: 1.206 | M: 1.183 | S: 0.030 | T: 3.458 || ETA: 9 days, 6:41:56 || timer: 7.139 [550] 25860 || B: 1.065 | C: 1.216 | M: 1.198 | S: 0.033 | T: 3.512 || ETA: 9 days, 6:39:25 || timer: 0.656 [550] 25870 || B: 1.104 | C: 1.220 | M: 1.204 | S: 0.034 | T: 3.561 || ETA: 9 days, 1:03:49 || timer: 0.672 [550] 25880 || B: 1.102 | C: 1.210 | M: 1.228 | S: 0.035 | T: 3.575 || ETA: 9 days, 1:02:00 || timer: 0.672 [550] 25890 || B: 1.167 | C: 1.272 | M: 1.234 | S: 0.036 | T: 3.709 || ETA: 9 days, 0:58:45 || timer: 0.672

Computing validation mAP (this may take a while)...

Calculating mAP...

   |  all  |  .50  |  .55  |  .60  |  .65  |  .70  |  .75  |  .80  |  .85  |  .90  |  .95  |

-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ box | 45.11 | 76.88 | 74.73 | 71.12 | 66.21 | 59.49 | 50.35 | 34.31 | 15.36 | 2.54 | 0.13 | mask | 39.38 | 65.80 | 63.53 | 59.55 | 54.96 | 50.34 | 43.75 | 32.93 | 19.52 | 3.40 | 0.07 | -------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

[551] 25900 || B: 1.190 | C: 1.298 | M: 1.234 | S: 0.038 | T: 3.761 || ETA: 9 days, 6:23:57 || timer: 0.672 [551] 25910 || B: 1.245 | C: 1.379 | M: 1.283 | S: 0.041 | T: 3.948 || ETA: 9 days, 4:49:30 || timer: 0.672 [551] 25920 || B: 1.262 | C: 1.488 | M: 1.316 | S: 0.045 | T: 4.110 || ETA: 9 days, 4:46:56 || timer: 0.687 [551] 25930 || B: 1.301 | C: 1.527 | M: 1.361 | S: 0.047 | T: 4.236 || ETA: 9 days, 4:46:12 || timer: 0.672 [551] 25940 || B: 1.334 | C: 1.597 | M: 1.390 | S: 0.049 | T: 4.370 || ETA: 9 days, 4:45:20 || timer: 0.656 [552] 25950 || B: 1.392 | C: 1.683 | M: 1.450 | S: 0.050 | T: 4.575 || ETA: 9 days, 6:05:11 || timer: 0.656 [552] 25960 || B: 1.421 | C: 1.746 | M: 1.496 | S: 0.051 | T: 4.713 || ETA: 9 days, 0:39:18 || timer: 0.656 [552] 25970 || B: 1.467 | C: 1.798 | M: 1.512 | S: 0.052 | T: 4.828 || ETA: 9 days, 0:38:28 || timer: 0.672 [552] 25980 || B: 1.564 | C: 1.884 | M: 1.569 | S: 0.057 | T: 5.074 || ETA: 9 days, 0:38:05 || timer: 0.672 C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:115: block: [7,0,0], thread: [64,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:115: block: [7,0,0], thread: [65,0,0] Assertion input_val >= zero && input_val <= one failed. C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:115: block: [7,0,0], thread: [66,0,0] Assertion input_val >= zero && input_val <= one failed.

...

Traceback (most recent call last): File "train.py", line 504, in train() File "train.py", line 307, in train losses = net(datum) File "C:\EngTools\Anaconda3\2018.12\envs\yolact-env\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "C:\EngTools\Anaconda3\2018.12\envs\yolact-env\lib\site-packages\torch\nn\parallel\data_parallel.py", line 166, in forward return self.module(*inputs[0], **kwargs[0]) File "C:\EngTools\Anaconda3\2018.12\envs\yolact-env\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "train.py", line 146, in forward losses = self.criterion(self.net, preds, targets, masks, num_crowds) File "C:\EngTools\Anaconda3\2018.12\envs\yolact-env\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "C:\YOLACT\layers\modules\multibox_loss.py", line 159, in forward ret = self.lincomb_mask_loss(pos, idx_t, loc_data, mask_data, priors, proto_data, masks, gt_box_t, score_data, inst_data, labels) File "C:\YOLACT\layers\modules\multibox_loss.py", line 546, in lincomb_mask_loss pos_idx_t = idx_t[idx, cur_pos] RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.`

I'm at a loss. I've looked into the possible solutions, but I just cant figure out what is causing this. I've fixed all annotation file issues. What could this be? thank you

Mogarbobac avatar Apr 16 '22 17:04 Mogarbobac

Did you solve it? I had a similar problem when retraining a homemade dataset.

74284853 avatar May 10 '22 04:05 74284853

Hi, I had the same problem. For me it was because I had too much digits for the values in my annotation file (both segmentation and bbox). Rounding them to 2 digits worked for me (ex : 162.38 instead of 162.3839428458).

SamPujade avatar Aug 31 '22 11:08 SamPujade

Hi, I had the same problem. For me it was because I had too much digits for the values in my annotation file (both segmentation and bbox). Rounding them to 2 digits worked for me (ex : 162.38 instead of 162.3839428458).

In my annotation file segmentation and bbox are int types. ¿it could be a problem with the batch size or something like this?

nesi73 avatar Mar 16 '23 09:03 nesi73

Strangely, I still make the same mistake when I use the open training model for conversion

---Original--- From: "Inés @.> Date: Thu, Mar 16, 2023 17:26 PM To: @.>; Cc: @.@.>; Subject: Re: [dbolya/yolact] Loss Explosion? (Issue #746)

Hi, I had the same problem. For me it was because I had too much digits for the values in my annotation file (both segmentation and bbox). Rounding them to 2 digits worked for me (ex : 162.38 instead of 162.3839428458).

In my annotation file segmentation and bbox are int types. ¿it could be a problem with the batch size or something like this?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

74284853 avatar Mar 16 '23 09:03 74284853