oft
oft copied to clipboard
Training not working anymore
Hi Tom,
first of all, thanks for updating the repo and providing the inference script. However, there seems to be an issue now with the heatmap based scores during training. I did a clean clone of the repo and launched training as explained in the readme. Looking up the results in Tensorboard after 600 epochs, it can be seen, that the confidence maps don't show up any local maximas (while for the previous version of the repo, the confidence maps correctly showed that the network resolved depth uncertainity with increasing number of epochs and learned to localize objects). Hyperparameters as set by default (only set batch size to 8).
The inference script - using the old model checkpoints - worked for me after adapting NMS stage. Only one method (bbox_corners
) in utils.py was missing.
Do you have any idea, to get the training running again? Would appreciate any help on that - thank you!
Best regards, Chris
Hi Chris,
If you have a look at the compute_loss function in train.py, the loss function that was used before is the binary cross-entropy whereas in the latest version it is the Huber loss. One thing to notice as well is that total_loss= score_loss in both versions. Maybe it is more suitable to first learn the score only then finetune on the other tasks.
Hi Tom,
Thanks again for updating the repo and providing the inference script (only set batch size to 8). Like Chris, I did a clean clone of the repo and launched training as explained in the readme.
However, there seems to be an issue.
I got these values during training. ==> Training epoch complete score : 1.9330e+02 position: 1.6398e+07 dimension: 3.2379e+06 angle : 9.0900e+04 total : 1.9727e+07 === Beginning epoch 100 of 600 ===
This does not seem to be trained correctly. Is there any issue on the SIZE of INPUT IMAGE?
I would appreciate any help on that - thanks again!
Best regards, Younghyun
Hi aloukkal,
Yes, I am aware of the changes affecting the loss function and confidence map representation. The problem I was facing was: Using the new representation and loss computation, my network was not able to get certainity about depth at all. Even if I only trained on one single example/image and even if I tried to only learn the confidence score map of that single example, the network was not able to learn that specific score map (which would result in a right detection for this single training example).
Can someone share the last known working version in this repo?
@chris-doe @aloukkal @yhkim8412 @jackkwok Do you have any update on the issues you described here?
Hi Tom,
Thanks again for updating the repo and providing the inference script (only set batch size to 8). Like Chris, I did a clean clone of the repo and launched training as explained in the readme.
However, there seems to be an issue.
I got these values during training. ==> Training epoch complete score : 1.9330e+02 position: 1.6398e+07 dimension: 3.2379e+06 angle : 9.0900e+04 total : 1.9727e+07 === Beginning epoch 100 of 600 ===
This does not seem to be trained correctly. Is there any issue on the SIZE of INPUT IMAGE?
I would appreciate any help on that - thanks again!
Best regards, Younghyun
Even I am getting the same losses on the current version of the repo. How did you fix it?