MTR
MTR copied to clipboard
Training become slower
Hi, When I trained MTR with 20% Waymo data on 8 RTX4090, the estimated training time gradually increased from ~6 hours to ~23 hours. The outputs from logs are shown below. Any idea about the problem? Thanks!
2024-01-13 20:29:41,786 INFO epoch: 0/30, acc_iter=1, cur_iter=0/1218, batch_size=10, iter_cost=14.13s, time_cost(epoch): 00:14/4:46:46, time_cost(all): 00:25/143:23:11, ade_TYPE_VEHICLE_layer_5=16.475, ade_TYPE_PEDESTRIAN_layer_5=3.344, ade_TYPE_CYCLIST_layer_5=10.283, loss=67801.148, lr=0.0001
2024-01-13 20:30:09,304 INFO epoch: 0/30, acc_iter=50, cur_iter=49/1218, batch_size=10, iter_cost=0.83s, time_cost(epoch): 00:41/16:13, time_cost(all): 00:53/8:26:33, ade_TYPE_VEHICLE_layer_5=16.648, ade_TYPE_PEDESTRIAN_layer_5=2.845, ade_TYPE_CYCLIST_layer_5=8.868, loss=3624.866, lr=0.0001
2024-01-13 20:30:37,870 INFO epoch: 0/30, acc_iter=100, cur_iter=99/1218, batch_size=10, iter_cost=0.70s, time_cost(epoch): 01:10/13:05, time_cost(all): 01:21/7:06:25, ade_TYPE_VEHICLE_layer_5=20.148, ade_TYPE_PEDESTRIAN_layer_5=2.516, ade_TYPE_CYCLIST_layer_5=-0.000, loss=1319.947, lr=0.0001
2024-01-13 20:31:05,643 INFO epoch: 0/30, acc_iter=150, cur_iter=149/1218, batch_size=10, iter_cost=0.65s, time_cost(epoch): 01:37/11:38, time_cost(all): 01:49/6:36:11, ade_TYPE_VEHICLE_layer_5=11.571, ade_TYPE_PEDESTRIAN_layer_5=1.302, ade_TYPE_CYCLIST_layer_5=3.274, loss=1064.818, lr=0.0001
2024-01-13 20:31:33,584 INFO epoch: 0/30, acc_iter=200, cur_iter=199/1218, batch_size=10, iter_cost=0.63s, time_cost(epoch): 02:05/10:41, time_cost(all): 02:17/6:21:21, ade_TYPE_VEHICLE_layer_5=12.051, ade_TYPE_PEDESTRIAN_layer_5=1.196, ade_TYPE_CYCLIST_layer_5=-0.000, loss=896.819, lr=0.0001
2024-01-13 20:32:01,143 INFO epoch: 0/30, acc_iter=250, cur_iter=249/1218, batch_size=10, iter_cost=0.61s, time_cost(epoch): 02:33/09:54, time_cost(all): 02:44/6:11:20, ade_TYPE_VEHICLE_layer_5=13.017, ade_TYPE_PEDESTRIAN_layer_5=1.770, ade_TYPE_CYCLIST_layer_5=3.670, loss=769.815, lr=0.0001
2024-01-13 20:34:44,646 INFO Save latest model to /home/swc/Disk/Code/MTR/output/waymo/mtr+20_percent_data/my_first_exp/ckpt/latest_model
2024-01-13 20:35:02,177 INFO epoch: 0/30, acc_iter=300, cur_iter=299/1218, batch_size=10, iter_cost=1.12s, time_cost(epoch): 05:34/17:04, time_cost(all): 05:45/11:13:30, ade_TYPE_VEHICLE_layer_5=20.119, ade_TYPE_PEDESTRIAN_layer_5=1.318, ade_TYPE_CYCLIST_layer_5=7.704, loss=980.100, lr=0.0001
2024-01-13 20:36:56,445 INFO epoch: 0/30, acc_iter=350, cur_iter=349/1218, batch_size=10, iter_cost=1.28s, time_cost(epoch): 07:28/18:34, time_cost(all): 07:40/12:53:25, ade_TYPE_VEHICLE_layer_5=13.891, ade_TYPE_PEDESTRIAN_layer_5=1.182, ade_TYPE_CYCLIST_layer_5=-0.000, loss=764.403, lr=0.0001
2024-01-13 20:40:12,034 INFO Save latest model to /home/swc/Disk/Code/MTR/output/waymo/mtr+20_percent_data/my_first_exp/ckpt/latest_model
2024-01-13 20:40:12,624 INFO epoch: 0/30, acc_iter=400, cur_iter=399/1218, batch_size=10, iter_cost=1.61s, time_cost(epoch): 10:44/22:00, time_cost(all): 10:56/16:11:14, ade_TYPE_VEHICLE_layer_5=11.558, ade_TYPE_PEDESTRIAN_layer_5=1.168, ade_TYPE_CYCLIST_layer_5=8.246, loss=723.942, lr=0.0001
2024-01-13 20:42:28,241 INFO epoch: 0/30, acc_iter=450, cur_iter=449/1218, batch_size=10, iter_cost=1.73s, time_cost(epoch): 13:00/22:13, time_cost(all): 13:11/17:23:24, ade_TYPE_VEHICLE_layer_5=13.028, ade_TYPE_PEDESTRIAN_layer_5=0.897, ade_TYPE_CYCLIST_layer_5=1.655, loss=654.211, lr=0.0001
2024-01-13 20:45:09,249 INFO Save latest model to /home/swc/Disk/Code/MTR/output/waymo/mtr+20_percent_data/my_first_exp/ckpt/latest_model
2024-01-13 20:45:17,802 INFO epoch: 0/30, acc_iter=500, cur_iter=499/1218, batch_size=10, iter_cost=1.90s, time_cost(epoch): 15:50/22:46, time_cost(all): 16:01/19:01:28, ade_TYPE_VEHICLE_layer_5=12.500, ade_TYPE_PEDESTRIAN_layer_5=1.334, ade_TYPE_CYCLIST_layer_5=8.013, loss=635.699, lr=0.0001
2024-01-13 20:48:13,740 INFO epoch: 0/30, acc_iter=550, cur_iter=549/1218, batch_size=10, iter_cost=2.05s, time_cost(epoch): 18:46/22:49, time_cost(all): 18:57/20:28:08, ade_TYPE_VEHICLE_layer_5=11.864, ade_TYPE_PEDESTRIAN_layer_5=0.763, ade_TYPE_CYCLIST_layer_5=6.737, loss=609.126, lr=0.0001
2024-01-13 20:50:12,644 INFO Save latest model to /home/swc/Disk/Code/MTR/output/waymo/mtr+20_percent_data/my_first_exp/ckpt/latest_model
2024-01-13 20:50:28,692 INFO epoch: 0/30, acc_iter=600, cur_iter=599/1218, batch_size=10, iter_cost=2.10s, time_cost(epoch): 21:01/21:40, time_cost(all): 21:12/20:58:57, ade_TYPE_VEHICLE_layer_5=8.360, ade_TYPE_PEDESTRIAN_layer_5=0.767, ade_TYPE_CYCLIST_layer_5=3.163, loss=469.968, lr=0.0001
2024-01-13 20:53:57,198 INFO epoch: 0/30, acc_iter=650, cur_iter=649/1218, batch_size=10, iter_cost=2.26s, time_cost(epoch): 24:29/21:26, time_cost(all): 24:40/22:32:23, ade_TYPE_VEHICLE_layer_5=9.355, ade_TYPE_PEDESTRIAN_layer_5=0.774, ade_TYPE_CYCLIST_layer_5=-0.000, loss=610.168, lr=0.0001
2024-01-13 20:54:53,206 INFO Save latest model to /home/swc/Disk/Code/MTR/output/waymo/mtr+20_percent_data/my_first_exp/ckpt/latest_model
2024-01-13 20:56:39,561 INFO epoch: 0/30, acc_iter=700, cur_iter=699/1218, batch_size=10, iter_cost=2.33s, time_cost(epoch): 27:11/20:09, time_cost(all): 27:23/23:12:35, ade_TYPE_VEHICLE_layer_5=11.248, ade_TYPE_PEDESTRIAN_layer_5=0.852, ade_TYPE_CYCLIST_layer_5=4.962, loss=430.743, lr=0.0001
2024-01-13 20:58:45,682 INFO epoch: 0/30, acc_iter=750, cur_iter=749/1218, batch_size=10, iter_cost=2.34s, time_cost(epoch): 29:18/18:19, time_cost(all): 29:29/23:18:15, ade_TYPE_VEHICLE_layer_5=6.773, ade_TYPE_PEDESTRIAN_layer_5=0.567, ade_TYPE_CYCLIST_layer_5=6.999, loss=384.074, lr=0.0001
2024-01-13 21:00:07,164 INFO Save latest model to /home/swc/Disk/Code/MTR/output/waymo/mtr+20_percent_data/my_first_exp/ckpt/latest_model
2024-01-13 21:00:24,382 INFO epoch: 0/30, acc_iter=800, cur_iter=799/1218, batch_size=10, iter_cost=2.32s, time_cost(epoch): 30:56/16:12, time_cost(all): 31:08/23:02:31, ade_TYPE_VEHICLE_layer_5=9.502, ade_TYPE_PEDESTRIAN_layer_5=0.770, ade_TYPE_CYCLIST_layer_5=0.886, loss=555.664, lr=0.0001
2024-01-13 21:02:19,946 INFO epoch: 0/30, acc_iter=850, cur_iter=849/1218, batch_size=10, iter_cost=2.32s, time_cost(epoch): 32:52/14:16, time_cost(all): 33:03/23:00:15, ade_TYPE_VEHICLE_layer_5=6.325, ade_TYPE_PEDESTRIAN_layer_5=0.464, ade_TYPE_CYCLIST_layer_5=-0.000, loss=489.205, lr=0.0001
2024-01-13 21:03:55,158 INFO epoch: 0/30, acc_iter=900, cur_iter=899/1218, batch_size=10, iter_cost=2.30s, time_cost(epoch): 34:27/12:12, time_cost(all): 34:38/22:44:35, ade_TYPE_VEHICLE_layer_5=6.792, ade_TYPE_PEDESTRIAN_layer_5=0.345, ade_TYPE_CYCLIST_layer_5=2.614, loss=502.375, lr=0.0001
2024-01-13 21:05:36,883 INFO Save latest model to /home/swc/Disk/Code/MTR/output/waymo/mtr+20_percent_data/my_first_exp/ckpt/latest_model
2024-01-13 21:05:58,604 INFO epoch: 0/30, acc_iter=950, cur_iter=949/1218, batch_size=10, iter_cost=2.31s, time_cost(epoch): 36:30/10:20, time_cost(all): 36:42/22:48:02, ade_TYPE_VEHICLE_layer_5=9.518, ade_TYPE_PEDESTRIAN_layer_5=1.167, ade_TYPE_CYCLIST_layer_5=4.977, loss=575.544, lr=0.0001
2024-01-13 21:08:21,160 INFO epoch: 0/30, acc_iter=1000, cur_iter=999/1218, batch_size=10, iter_cost=2.33s, time_cost(epoch): 38:53/08:31, time_cost(all): 39:04/23:02:14, ade_TYPE_VEHICLE_layer_5=6.204, ade_TYPE_PEDESTRIAN_layer_5=0.280, ade_TYPE_CYCLIST_layer_5=2.212, loss=407.410, lr=0.0001
2024-01-13 21:10:18,694 INFO Save latest model to /home/swc/Disk/Code/MTR/output/waymo/mtr+20_percent_data/my_first_exp/ckpt/latest_model
2024-01-13 21:10:31,690 INFO epoch: 0/30, acc_iter=1050, cur_iter=1049/1218, batch_size=10, iter_cost=2.35s, time_cost(epoch): 41:04/06:36, time_cost(all): 41:15/23:08:06, ade_TYPE_VEHICLE_layer_5=7.311, ade_TYPE_PEDESTRIAN_layer_5=0.306, ade_TYPE_CYCLIST_layer_5=-0.000, loss=578.605, lr=0.0001
2024-01-13 21:13:06,837 INFO epoch: 0/30, acc_iter=1100, cur_iter=1099/1218, batch_size=10, iter_cost=2.38s, time_cost(epoch): 43:39/04:43, time_cost(all): 43:50/23:26:27, ade_TYPE_VEHICLE_layer_5=9.303, ade_TYPE_PEDESTRIAN_layer_5=0.281, ade_TYPE_CYCLIST_layer_5=4.509, loss=444.431, lr=0.0001
2024-01-13 21:14:56,918 INFO Save latest model to /home/swc/Disk/Code/MTR/output/waymo/mtr+20_percent_data/my_first_exp/ckpt/latest_model
2024-01-13 21:15:07,665 INFO epoch: 0/30, acc_iter=1150, cur_iter=1149/1218, batch_size=10, iter_cost=2.38s, time_cost(epoch): 45:40/02:44, time_cost(all): 45:51/23:25:23, ade_TYPE_VEHICLE_layer_5=2.822, ade_TYPE_PEDESTRIAN_layer_5=0.356, ade_TYPE_CYCLIST_layer_5=4.349, loss=283.361, lr=0.0001
2024-01-13 21:16:56,916 INFO epoch: 0/30, acc_iter=1200, cur_iter=1199/1218, batch_size=10, iter_cost=2.37s, time_cost(epoch): 47:29/00:45, time_cost(all): 47:40/23:18:32, ade_TYPE_VEHICLE_layer_5=4.185, ade_TYPE_PEDESTRIAN_layer_5=0.564, ade_TYPE_CYCLIST_layer_5=2.659, loss=252.469, lr=0.0001
2024-01-13 21:17:06,637 INFO epoch: 0/30, acc_iter=1218, cur_iter=1217/1218, batch_size=5, iter_cost=2.35s, time_cost(epoch): 47:38/00:02, time_cost(all): 47:50/23:01:52, ade_TYPE_VEHICLE_layer_5=7.351, ade_TYPE_PEDESTRIAN_layer_5=0.358, ade_TYPE_CYCLIST_layer_5=5.175, loss=555.010, lr=0.0001
I found that reducing batch size per GPU from 10 to 8 will accelerate training to ~7 hours.