FairMOT
FairMOT copied to clipboard
Nan Issue : train crowhuman dataset with nan loss
I try to change lr param, but the issue still exists.
@YmanChris It can be seen that the id_loss is nan which is about your data, and please post your date format and data label that we can find the solution.
@YmanChris It can be seen that the id_loss is nan which is about your data, and please post your date format and data label that we can find the solution.
Thank you for your reply. I use the latest version of code and official crowdhuman dataset. The data labels are created by "python gen_labels_crowd_id.py" order. batch_size = 1, gpu 2080ti 11g. I try to narrow lr but the nan loss also exists. Then I switch the code into older version, the nan loss has relieved.
@ifzhang @zengjie617789
I am also having similar issue where I am using hrnet_w18 to train custum dataset. The data format for me is frame_id, vehicle_id, bbox_x_center, ,bbox_y_center,bbox_width,bbox_x_height. An example from one frame is like 0 1 0.503918 0.455146 0.113289 0.078237
0 2 0.919396 0.386570 0.016446 0.013871
0 3 0.951217 0.386777 0.018234 0.016974
0 4 0.818556 0.402218 0.033109 0.019250
0 5 0.961950 0.383810 0.007737 0.009976
0 6 0.563671 0.440237 0.051045 0.032495
0 7 0.884693 0.392502 0.016599 0.012779
0 8 0.947196 0.389307 0.019590 0.020804
0 9 0.894851 0.394488 0.014566 0.012718
My training loss looks like train: [9][4999/5000]|Tot: 1:00:24 |ETA: 0:00:01 |loss nan |hm_loss 4.2721 |wh_loss 64.5495 |off_loss 0.1975 |id_loss nan |Data 0.006s(0.009s) |Net 0.725s
I went back to check my generate label function, which is pretty much the same as 'gen_labels_crowd_id.py'. But the tid_curr
looks like it is constantly increasing and does not have any overlap with any previous frame's tid_curr
. I feel like this might be concerning. Anybody can confirm my thought? Or anybody can tell me what the tid
is ?
I have tried the ways mentioned in issue#205 but seem not working.
@DioMou As far as i am concerned, the data didnot contain frame_id. I am using Crowdhuman which the iamges are incontinuous.
Here are my data label below:
0 200197 0.067130 0.647049 0.177315 0.771875
the first column is about class, the second column is about data_index.
@zengjie617789 Thanks for your reply, that makes sense to me. I also have that column as 0 for all frames, I forgot to mention. Do you mind explain what your data_index is? Is that the ID that is attached to bbox? Does that number increase by 1 every row and every frame? Or simply indicating this frame has bbox appearing?
@DioMou That number is the total bounding box in the whole datases. In Fairmot, the task of tracking is treated as a classfication task. That is mean we construct a linear classifier with a nearly 300,000 nodes. here is my one label txt
0 229313 0.274286 0.658831 0.144286 0.420584
0 229314 0.332857 0.670902 0.175714 0.429479
0 229315 0.450000 0.657560 0.185714 0.390089
0 229316 0.515000 0.580686 0.117143 0.526048
0 229317 0.558929 0.597840 0.239286 0.484117
0 229318 0.687143 0.642313 0.195714 0.410419
0 229319 0.837857 0.651842 0.215714 0.421855
0 229320 0.868214 0.644219 0.112143 0.396442
0 229321 0.088571 0.426302 0.164286 0.626429
0 229322 0.085714 0.440280 0.135714 0.606099
0 229323 0.168929 0.464422 0.115000 0.540025
0 229324 0.236429 0.447268 0.127143 0.594663
0 229325 0.282143 0.431385 0.105714 0.593393
0 229326 0.353214 0.430114 0.117857 0.545108
0 229327 0.415714 0.440280 0.087143 0.494282
0 229328 0.433929 0.370394 0.145000 0.692503
0 229329 0.498929 0.370394 0.143571 0.684879
0 229330 0.573929 0.425667 0.087857 0.515883
0 229331 0.642500 0.402160 0.076429 0.501906
0 229332 0.733571 0.356417 0.167143 0.669632
0 229333 0.762500 0.459339 0.123571 0.540025
0 229334 0.820000 0.403431 0.107143 0.557814
0 229335 0.865000 0.438374 0.155714 0.526048
0 229336 0.967857 0.425032 0.097143 0.575604
0 229337 0.585357 0.144854 0.072143 0.330368
0 229338 0.906429 0.130241 0.085714 0.349428
0 229339 0.634643 0.655654 0.109286 0.401525
@zengjie617789 how to generate it
cd src
python gen_labels_crowd_id.py