bisenet-tensorflow icon indicating copy to clipboard operation
bisenet-tensorflow copied to clipboard

How do you train cityscapes data sets

Open fanhuanhuan opened this issue 5 years ago • 4 comments

I try to use your models frame to train cityscapes, but the loss become very large. Finetuning use your CKPT is also have the same issues. Can you tell me where code i need to change. Thank you!

The output name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0) INFO - root - Train for 371875 steps INFO - root - 2019-11-23 02:40:07.439513: step 0, total loss = 4.58, predict loss = 1.52, mean_iou = 0.000 (0.8 examples/sec; 19.793 sec/batch; 2044h:37m:55s remains) INFO - root - 2019-11-23 02:40:44.343236: step 10, total loss = 2.92, predict loss = 1.03, mean_iou = 0.059 (6.1 examples/sec; 2.603 sec/batch; 268h:55m:19s remains) INFO - root - 2019-11-23 02:41:10.206275: step 20, total loss = 3.12, predict loss = 1.07, mean_iou = 0.083 (6.7 examples/sec; 2.391 sec/batch; 247h:01m:01s remains) INFO - root - 2019-11-23 02:41:35.279294: step 30, total loss = 3.01, predict loss = 1.01, mean_iou = 0.070 (6.7 examples/sec; 2.404 sec/batch; 248h:21m:35s remains) INFO - root - 2019-11-23 02:41:59.938955: step 40, total loss = 3.36, predict loss = 1.13, mean_iou = 0.077 (6.6 examples/sec; 2.432 sec/batch; 251h:09m:57s remains) INFO - root - 2019-11-23 02:42:24.156087: step 50, total loss = 3.76, predict loss = 1.23, mean_iou = 0.078 (7.5 examples/sec; 2.136 sec/batch; 220h:34m:00s remains) INFO - root - 2019-11-23 02:42:49.229527: step 60, total loss = 3.51, predict loss = 1.16, mean_iou = 0.073 (6.9 examples/sec; 2.333 sec/batch; 240h:59m:09s remains) INFO - root - 2019-11-23 02:43:14.525177: step 70, total loss = 2.67, predict loss = 0.89, mean_iou = 0.072 (6.9 examples/sec; 2.317 sec/batch; 239h:15m:06s remains) INFO - root - 2019-11-23 02:43:38.998149: step 80, total loss = 3.72, predict loss = 1.26, mean_iou = 0.068 (7.4 examples/sec; 2.172 sec/batch; 224h:17m:07s remains) INFO - root - 2019-11-23 02:44:04.097254: step 90, total loss = 2.81, predict loss = 0.94, mean_iou = 0.070 (6.8 examples/sec; 2.339 sec/batch; 241h:30m:59s remains) INFO - root - 2019-11-23 02:44:28.643472: step 100, total loss = 3.27, predict loss = 1.08, mean_iou = 0.066 (7.2 examples/sec; 2.208 sec/batch; 228h:04m:23s remains) INFO - root - 2019-11-23 02:44:53.291283: step 110, total loss = 3.66, predict loss = 1.21, mean_iou = 0.066 (7.1 examples/sec; 2.269 sec/batch; 234h:19m:05s remains) INFO - root - 2019-11-23 02:45:18.099931: step 120, total loss = 3.65, predict loss = 1.20, mean_iou = 0.067 (6.9 examples/sec; 2.320 sec/batch; 239h:36m:17s remains) INFO - root - 2019-11-23 02:45:43.630463: step 130, total loss = 2.95, predict loss = 0.95, mean_iou = 0.065 (6.4 examples/sec; 2.492 sec/batch; 257h:18m:32s remains) INFO - root - 2019-11-23 02:46:08.486235: step 140, total loss = 3.13, predict loss = 1.01, mean_iou = 0.065 (6.7 examples/sec; 2.380 sec/batch; 245h:44m:13s remains) INFO - root - 2019-11-23 02:46:33.434408: step 150, total loss = 3.64, predict loss = 1.19, mean_iou = 0.064 (7.0 examples/sec; 2.283 sec/batch; 235h:42m:14s remains) INFO - root - 2019-11-23 02:46:57.587377: step 160, total loss = 3.80, predict loss = 1.23, mean_iou = 0.066 (7.4 examples/sec; 2.149 sec/batch; 221h:53m:32s remains) INFO - root - 2019-11-23 02:47:26.498170: step 170, total loss = 3.82, predict loss = 1.23, mean_iou = 0.066 (7.3 examples/sec; 2.193 sec/batch; 226h:25m:32s remains) INFO - root - 2019-11-23 02:47:51.260680: step 180, total loss = 3.33, predict loss = 1.03, mean_iou = 0.065 (7.3 examples/sec; 2.197 sec/batch; 226h:51m:14s remains) INFO - root - 2019-11-23 02:48:17.376060: step 190, total loss = 3.23, predict loss = 1.01, mean_iou = 0.067 (6.9 examples/sec; 2.328 sec/batch; 240h:21m:30s remains) INFO - root - 2019-11-23 02:48:42.515721: step 200, total loss = 3.87, predict loss = 1.20, mean_iou = 0.067 (7.0 examples/sec; 2.302 sec/batch; 237h:36m:53s remains) INFO - root - 2019-11-23 02:49:07.094238: step 210, total loss = 3.83, predict loss = 1.13, mean_iou = 0.066 (6.5 examples/sec; 2.446 sec/batch; 252h:32m:18s remains) INFO - root - 2019-11-23 02:49:32.424956: step 220, total loss = 3.39, predict loss = 1.07, mean_iou = 0.069 (6.9 examples/sec; 2.315 sec/batch; 239h:00m:21s remains) INFO - root - 2019-11-23 02:49:57.397433: step 230, total loss = 7.11, predict loss = 2.62, mean_iou = 0.068 (7.0 examples/sec; 2.278 sec/batch; 235h:08m:20s remains) INFO - root - 2019-11-23 02:50:22.523897: step 240, total loss = 8.71, predict loss = 2.34, mean_iou = 0.068 (6.7 examples/sec; 2.391 sec/batch; 246h:51m:57s remains) INFO - root - 2019-11-23 02:50:47.389149: step 250, total loss = 8.84, predict loss = 1.41, mean_iou = 0.067 (7.2 examples/sec; 2.225 sec/batch; 229h:44m:03s remains) INFO - root - 2019-11-23 02:51:12.569414: step 260, total loss = 11.53, predict loss = 2.91, mean_iou = 0.065 (6.9 examples/sec; 2.322 sec/batch; 239h:40m:32s remains) INFO - root - 2019-11-23 02:51:38.063683: step 270, total loss = 28.99, predict loss = 5.99, mean_iou = 0.064 (6.7 examples/sec; 2.383 sec/batch; 246h:01m:19s remains) INFO - root - 2019-11-23 02:52:02.863234: step 280, total loss = 28.98, predict loss = 8.98, mean_iou = 0.065 (6.7 examples/sec; 2.403 sec/batch; 248h:00m:57s remains) INFO - root - 2019-11-23 02:52:27.547999: step 290, total loss = 36.74, predict loss = 5.29, mean_iou = 0.065 (6.9 examples/sec; 2.308 sec/batch; 238h:13m:06s remains) INFO - root - 2019-11-23 02:52:52.677785: step 300, total loss = 66.27, predict loss = 19.59, mean_iou = 0.067 (6.5 examples/sec; 2.466 sec/batch; 254h:31m:41s remains) INFO - root - 2019-11-23 02:53:17.561659: step 310, total loss = 118.04, predict loss = 29.41, mean_iou = 0.068 (6.9 examples/sec; 2.312 sec/batch; 238h:39m:21s remains) INFO - root - 2019-11-23 02:53:42.351794: step 320, total loss = 196.11, predict loss = 78.41, mean_iou = 0.067 (6.1 examples/sec; 2.621 sec/batch; 270h:29m:50s remains) INFO - root - 2019-11-23 02:54:07.523196: step 330, total loss = 122.57, predict loss = 31.54, mean_iou = 0.067 (6.7 examples/sec; 2.379 sec/batch; 245h:28m:51s remains) INFO - root - 2019-11-23 02:54:35.756016: step 340, total loss = 439.37, predict loss = 149.81, mean_iou = 0.070 (7.0 examples/sec; 2.301 sec/batch; 237h:28m:26s remains) INFO - root - 2019-11-23 02:55:00.994508: step 350, total loss = 238.67, predict loss = 64.45, mean_iou = 0.071 (7.1 examples/sec; 2.247 sec/batch; 231h:50m:55s remains) INFO - root - 2019-11-23 02:55:25.943258: step 360, total loss = 518.69, predict loss = 223.72, mean_iou = 0.072 (6.8 examples/sec; 2.351 sec/batch; 242h:34m:15s remains) INFO - root - 2019-11-23 02:55:50.487610: step 370, total loss = 670.67, predict loss = 196.96, mean_iou = 0.071 (6.7 examples/sec; 2.386 sec/batch; 246h:10m:31s remains) INFO - root - 2019-11-23 02:56:16.482531: step 380, total loss = 821.25, predict loss = 196.81, mean_iou = 0.070 (6.7 examples/sec; 2.397 sec/batch; 247h:23m:46s remains) INFO - root - 2019-11-23 02:56:41.164479: step 390, total loss = 1339.13, predict loss = 725.03, mean_iou = 0.071 (7.0 examples/sec; 2.284 sec/batch; 235h:43m:06s remains) INFO - root - 2019-11-23 02:57:06.145088: step 400, total loss = 867.97, predict loss = 295.67, mean_iou = 0.071 (6.9 examples/sec; 2.311 sec/batch; 238h:28m:46s remains) INFO - root - 2019-11-23 02:57:31.018449: step 410, total loss = 4501.52, predict loss = 1179.62, mean_iou = 0.071 (6.7 examples/sec; 2.381 sec/batch; 245h:37m:57s remains) INFO - root - 2019-11-23 02:57:55.855274: step 420, total loss = 2119.14, predict loss = 350.87, mean_iou = 0.071 (6.8 examples/sec; 2.337 sec/batch; 241h:10m:03s remains)

fanhuanhuan avatar Nov 23 '19 03:11 fanhuanhuan

You can try this code, but I am too lazy, I don't want to modify the code inside github.By the way, I have no idea about why the loss become that large too. @fanhuanhuan

pdoublerainbow avatar Nov 23 '19 04:11 pdoublerainbow

The code is useful, thank you @pdoublerainbow

fanhuanhuan avatar Nov 24 '19 02:11 fanhuanhuan

Hello guys, I have use the cityscape above, and when i trained on google colab. This is the first time I re-trained the dataset by myself. When I'm training, INFO - root - 2020-06-04 08:33:20.191866: step 260, total loss = 35261836.00, predict loss = 27243408.00 (1.8 examples/sec; 4.503 sec/batch; 697h:22m:09s remains). Why the both loss is so large @@

HienTran1997 avatar Jun 04 '20 08:06 HienTran1997