deep-anpr
deep-anpr copied to clipboard
The performance during training is always the same
Dear all, I am trying to train the model on windows 10 (CPU). The problem that I am finding is that the performance doesn't change at all even if the cost change a little. If I rerun the training the performance values change but remain again constant. Here is a snippet:
B7860 64.00% 64.00% loss: 10794.2373046875 (digits: 1051.9578857421875, presence: 9742.279296875) | X X XX X X XXX X X XX X X X XX | time for 60 batches 324.8394412994385 PV73LEX 0.0 <-> QM69OTK 0.0 KZ48OUS 1.0 <-> QM69OTK 0.0 XF10UGX 0.0 <-> QM69OTK 0.0 HP51SYY 0.0 <-> QM69OTK 0.0 MQ82HOD 0.0 <-> QM69OTK 0.0 YF62RYQ 0.0 <-> QM69OTK 0.0 LE19HIO 0.0 <-> QM69OTK 0.0 XG44DHU 1.0 <-> QM69OTK 0.0 WM08RYQ 0.0 <-> QM69OTK 0.0 TZ23KIA 0.0 <-> QM69OTK 0.0 FB39LOJ 1.0 <-> QM69OTW 0.0 CP55DID 1.0 <-> QM69OTK 0.0 PN26VBI 0.0 <-> QM69OTK 0.0 FO65FUI 0.0 <-> QM69OTK 0.0 OP09YVZ 1.0 <-> QM69OTK 0.0 SK87TTT 0.0 <-> QM69OTK 0.0 EE78HSB 0.0 <-> QM69OTK 0.0 NM15DHP 1.0 <-> QM69OTK 0.0 WY52RKZ 0.0 <-> QM69OTK 0.0 AE21YYQ 0.0 <-> QM39OTK 0.0 AT37NOB 0.0 <-> QM69OTK 0.0 DD97XRW 0.0 <-> QM69OTK 0.0 DV44XSO 0.0 <-> QM69OTK 0.0 EX56ARF 1.0 <-> QM69OTK 0.0 RN63AOR 1.0 <-> QM69OTK 0.0 SQ19HKQ 1.0 <-> QM69OTK 0.0 QL68VPS 0.0 <-> QM69OTK 0.0 UJ87YEA 0.0 <-> QM69OTK 0.0 VN48ULX 1.0 <-> QM69OTK 0.0 DG23BSJ 0.0 <-> QM69OTK 0.0 GD77UFQ 0.0 <-> QM69OTK 0.0 RN27AOA 0.0 <-> QM69OTK 0.0 QX18QPV 0.0 <-> QM69OTK 0.0 KQ35RDE 1.0 <-> QM69OTK 0.0 IF80QMX 0.0 <-> QM69OTK 0.0 CE21AVV 1.0 <-> QM69OTK 0.0 UB26TQZ 1.0 <-> QM69OTK 0.0 EI30JGL 0.0 <-> QM69OTK 0.0 OU28NEY 1.0 <-> QM69OTK 0.0 MN01XZT 0.0 <-> QM69OTK 0.0 WK15APF 0.0 <-> QM69OTK 0.0 SS66HYB 1.0 <-> QM69OTK 0.0 NW44SQL 0.0 <-> QM69OTK 0.0 XI75LCF 0.0 <-> QM69OTK 0.0 IQ93XRG 0.0 <-> QM69OTK 0.0 NJ17XKK 1.0 <-> QM69OTK 0.0 MV55MGF 0.0 <-> QM69OTK 0.0 DK30EQB 1.0 <-> QM69OTK 0.0 WO74RMB 1.0 <-> QM69OTK 0.0 HV08HRX 0.0 <-> QM69OTK 0.0 B7880 64.00% 64.00% loss: 10789.783203125 (digits: 1051.4071044921875, presence: 9738.3759765625) | X X XX X X XXX X X XX X X X XX | time for 60 batches 319.3657536506653
Please anyone had a similar issue?
Thank you in advance, Best
how long did u train to get 64% accuracy?
Sir,
I have been training the model for more than 24 hours and the performance did not change. It started at 64% and remained there.
after 6 to 7 hours of training mine correction rate was 0% .it started at 0 and remained 0 after that much training. i'm training on gpu gtx1060 .any suggestion?
I have the same issue. Could you fix it?
decrease your learn rate
how did you guys change the batch_size. it takes only the first 50 images. ??!!
@Abduoit around line 265 of train.py is an parameter to train method called batch_size. But it's not taking only the first 50 images. this is the batch size and will take for each epoch an different batch to training. What is the same is the batch for test, around the line 232 is taking 50 images for dataset to test.
Thanks @WaGjUb Do you mean the first 50 images that we see in the terminal are for testing not for training, does this effect the training process?
I found this line in the train.py
test_xs, test_ys = unzip(list(read_data("test/*.png"))[:50])
I changed it to this
test_xs, test_ys = unzip(list(read_data("test/*.png"))[:batch_size])
and I left the line at the end as same like this
batch_size=50,
But I don't think this is correct, any suggestion please, should I leave it as it is ??
@Abduoit
Do you mean the first 50 images that we see in the terminal are for testing not for training, does this effect the training process?
Yes! I think so because the training try to minimize the loss as you can see around line 175 "train_step = tf.train.AdamOptimizer(learn_rate).minimize(loss)" It try to minimize the loss and the loss is calculated by the given result of predictions of test
But I don't think this is correct, any suggestion please, should I leave it as it is ??
I don't think you must do it, but will work as well. You just changed your test size as the same of train batch size.
I think there is an error with the get_loss function:
def get_loss(y, y_): # Calculate the loss from digits being incorrect. Don't count loss from # digits that are in non-present plates. digits_loss = tf.nn.softmax_cross_entropy_with_logits(tf.reshape(y[:, 1:], [-1, len(common.CHARS)]),tf.reshape(y_[:, 1:],[-1, len(common.CHARS)]))
If I understand right "y" are predictions and "y_" labels, so when calling tf.nn.softmax_cross_entropy_with_logits the parameters order should be:
tf.nn.softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, dim=-1, name=None ) So y_should go in first position and then y, and it cannot be reversed since by definition there is a log that affects one of the terms and thus the function is not commutative:
https://stackoverflow.com/questions/36078411/tensorflow-are-my-logits-in-the-right-format-for-cross-entropy-function
So, get_loss has a bug and the order should be reversed if I am not wrong.
Let me know there is a mistake in my reasoning.