oft icon indicating copy to clipboard operation
oft copied to clipboard

Training not working anymore

Open chris-doe opened this issue 5 years ago • 6 comments

Hi Tom,

first of all, thanks for updating the repo and providing the inference script. However, there seems to be an issue now with the heatmap based scores during training. I did a clean clone of the repo and launched training as explained in the readme. Looking up the results in Tensorboard after 600 epochs, it can be seen, that the confidence maps don't show up any local maximas (while for the previous version of the repo, the confidence maps correctly showed that the network resolved depth uncertainity with increasing number of epochs and learned to localize objects). Hyperparameters as set by default (only set batch size to 8).

The inference script - using the old model checkpoints - worked for me after adapting NMS stage. Only one method (bbox_corners) in utils.py was missing.

Do you have any idea, to get the training running again? Would appreciate any help on that - thank you!

Best regards, Chris

chris-doe avatar Sep 03 '19 11:09 chris-doe

Hi Chris,

If you have a look at the compute_loss function in train.py, the loss function that was used before is the binary cross-entropy whereas in the latest version it is the Huber loss. One thing to notice as well is that total_loss= score_loss in both versions. Maybe it is more suitable to first learn the score only then finetune on the other tasks.

aloukkal avatar Oct 12 '19 14:10 aloukkal

Hi Tom,

Thanks again for updating the repo and providing the inference script (only set batch size to 8). Like Chris, I did a clean clone of the repo and launched training as explained in the readme.

However, there seems to be an issue.

I got these values during training. ==> Training epoch complete score : 1.9330e+02 position: 1.6398e+07 dimension: 3.2379e+06 angle : 9.0900e+04 total : 1.9727e+07 === Beginning epoch 100 of 600 ===

This does not seem to be trained correctly. Is there any issue on the SIZE of INPUT IMAGE?

I would appreciate any help on that - thanks again!

Best regards, Younghyun

yhkim8412 avatar Oct 16 '19 06:10 yhkim8412

Hi aloukkal,

Yes, I am aware of the changes affecting the loss function and confidence map representation. The problem I was facing was: Using the new representation and loss computation, my network was not able to get certainity about depth at all. Even if I only trained on one single example/image and even if I tried to only learn the confidence score map of that single example, the network was not able to learn that specific score map (which would result in a right detection for this single training example).

chris-doe avatar Oct 16 '19 12:10 chris-doe

Can someone share the last known working version in this repo?

jackkwok avatar Nov 28 '19 00:11 jackkwok

@chris-doe @aloukkal @yhkim8412 @jackkwok Do you have any update on the issues you described here?

IAMShashankk avatar May 13 '22 08:05 IAMShashankk

Hi Tom,

Thanks again for updating the repo and providing the inference script (only set batch size to 8). Like Chris, I did a clean clone of the repo and launched training as explained in the readme.

However, there seems to be an issue.

I got these values during training. ==> Training epoch complete score : 1.9330e+02 position: 1.6398e+07 dimension: 3.2379e+06 angle : 9.0900e+04 total : 1.9727e+07 === Beginning epoch 100 of 600 ===

This does not seem to be trained correctly. Is there any issue on the SIZE of INPUT IMAGE?

I would appreciate any help on that - thanks again!

Best regards, Younghyun

Even I am getting the same losses on the current version of the repo. How did you fix it?

IAMShashankk avatar May 13 '22 08:05 IAMShashankk