deep-anpr icon indicating copy to clipboard operation
deep-anpr copied to clipboard

The performance during training is always the same

Open KassemKallas opened this issue 6 years ago • 10 comments

Dear all, I am trying to train the model on windows 10 (CPU). The problem that I am finding is that the performance doesn't change at all even if the cost change a little. If I rerun the training the performance values change but remain again constant. Here is a snippet:

B7860 64.00% 64.00% loss: 10794.2373046875 (digits: 1051.9578857421875, presence: 9742.279296875) | X X XX X X XXX X X XX X X X XX | time for 60 batches 324.8394412994385 PV73LEX 0.0 <-> QM69OTK 0.0 KZ48OUS 1.0 <-> QM69OTK 0.0 XF10UGX 0.0 <-> QM69OTK 0.0 HP51SYY 0.0 <-> QM69OTK 0.0 MQ82HOD 0.0 <-> QM69OTK 0.0 YF62RYQ 0.0 <-> QM69OTK 0.0 LE19HIO 0.0 <-> QM69OTK 0.0 XG44DHU 1.0 <-> QM69OTK 0.0 WM08RYQ 0.0 <-> QM69OTK 0.0 TZ23KIA 0.0 <-> QM69OTK 0.0 FB39LOJ 1.0 <-> QM69OTW 0.0 CP55DID 1.0 <-> QM69OTK 0.0 PN26VBI 0.0 <-> QM69OTK 0.0 FO65FUI 0.0 <-> QM69OTK 0.0 OP09YVZ 1.0 <-> QM69OTK 0.0 SK87TTT 0.0 <-> QM69OTK 0.0 EE78HSB 0.0 <-> QM69OTK 0.0 NM15DHP 1.0 <-> QM69OTK 0.0 WY52RKZ 0.0 <-> QM69OTK 0.0 AE21YYQ 0.0 <-> QM39OTK 0.0 AT37NOB 0.0 <-> QM69OTK 0.0 DD97XRW 0.0 <-> QM69OTK 0.0 DV44XSO 0.0 <-> QM69OTK 0.0 EX56ARF 1.0 <-> QM69OTK 0.0 RN63AOR 1.0 <-> QM69OTK 0.0 SQ19HKQ 1.0 <-> QM69OTK 0.0 QL68VPS 0.0 <-> QM69OTK 0.0 UJ87YEA 0.0 <-> QM69OTK 0.0 VN48ULX 1.0 <-> QM69OTK 0.0 DG23BSJ 0.0 <-> QM69OTK 0.0 GD77UFQ 0.0 <-> QM69OTK 0.0 RN27AOA 0.0 <-> QM69OTK 0.0 QX18QPV 0.0 <-> QM69OTK 0.0 KQ35RDE 1.0 <-> QM69OTK 0.0 IF80QMX 0.0 <-> QM69OTK 0.0 CE21AVV 1.0 <-> QM69OTK 0.0 UB26TQZ 1.0 <-> QM69OTK 0.0 EI30JGL 0.0 <-> QM69OTK 0.0 OU28NEY 1.0 <-> QM69OTK 0.0 MN01XZT 0.0 <-> QM69OTK 0.0 WK15APF 0.0 <-> QM69OTK 0.0 SS66HYB 1.0 <-> QM69OTK 0.0 NW44SQL 0.0 <-> QM69OTK 0.0 XI75LCF 0.0 <-> QM69OTK 0.0 IQ93XRG 0.0 <-> QM69OTK 0.0 NJ17XKK 1.0 <-> QM69OTK 0.0 MV55MGF 0.0 <-> QM69OTK 0.0 DK30EQB 1.0 <-> QM69OTK 0.0 WO74RMB 1.0 <-> QM69OTK 0.0 HV08HRX 0.0 <-> QM69OTK 0.0 B7880 64.00% 64.00% loss: 10789.783203125 (digits: 1051.4071044921875, presence: 9738.3759765625) | X X XX X X XXX X X XX X X X XX | time for 60 batches 319.3657536506653

Please anyone had a similar issue?

Thank you in advance, Best

KassemKallas avatar May 11 '18 07:05 KassemKallas

how long did u train to get 64% accuracy?

muneeb991 avatar May 21 '18 07:05 muneeb991

Sir,

I have been training the model for more than 24 hours and the performance did not change. It started at 64% and remained there.

KassemKallas avatar May 29 '18 07:05 KassemKallas

after 6 to 7 hours of training mine correction rate was 0% .it started at 0 and remained 0 after that much training. i'm training on gpu gtx1060 .any suggestion?

muneeb991 avatar May 29 '18 10:05 muneeb991

I have the same issue. Could you fix it?

Cazador6 avatar Jul 01 '18 21:07 Cazador6

decrease your learn rate

WaGjUb avatar Jan 11 '19 11:01 WaGjUb

how did you guys change the batch_size. it takes only the first 50 images. ??!!

Abduoit avatar Jan 28 '19 21:01 Abduoit

@Abduoit around line 265 of train.py is an parameter to train method called batch_size. But it's not taking only the first 50 images. this is the batch size and will take for each epoch an different batch to training. What is the same is the batch for test, around the line 232 is taking 50 images for dataset to test.

WaGjUb avatar Feb 01 '19 12:02 WaGjUb

Thanks @WaGjUb Do you mean the first 50 images that we see in the terminal are for testing not for training, does this effect the training process?

I found this line in the train.py

test_xs, test_ys = unzip(list(read_data("test/*.png"))[:50])

I changed it to this

test_xs, test_ys = unzip(list(read_data("test/*.png"))[:batch_size]) and I left the line at the end as same like this
batch_size=50,

But I don't think this is correct, any suggestion please, should I leave it as it is ??

Abduoit avatar Feb 01 '19 16:02 Abduoit

@Abduoit

Do you mean the first 50 images that we see in the terminal are for testing not for training, does this effect the training process?

Yes! I think so because the training try to minimize the loss as you can see around line 175 "train_step = tf.train.AdamOptimizer(learn_rate).minimize(loss)" It try to minimize the loss and the loss is calculated by the given result of predictions of test

But I don't think this is correct, any suggestion please, should I leave it as it is ??

I don't think you must do it, but will work as well. You just changed your test size as the same of train batch size.

WaGjUb avatar Feb 02 '19 00:02 WaGjUb

I think there is an error with the get_loss function:

def get_loss(y, y_): # Calculate the loss from digits being incorrect. Don't count loss from # digits that are in non-present plates. digits_loss = tf.nn.softmax_cross_entropy_with_logits(tf.reshape(y[:, 1:], [-1, len(common.CHARS)]),tf.reshape(y_[:, 1:],[-1, len(common.CHARS)]))

If I understand right "y" are predictions and "y_" labels, so when calling tf.nn.softmax_cross_entropy_with_logits the parameters order should be:

tf.nn.softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, dim=-1, name=None ) So y_should go in first position and then y, and it cannot be reversed since by definition there is a log that affects one of the terms and thus the function is not commutative:

https://stackoverflow.com/questions/36078411/tensorflow-are-my-logits-in-the-right-format-for-cross-entropy-function

So, get_loss has a bug and the order should be reversed if I am not wrong.

Let me know there is a mistake in my reasoning.

mazcallu avatar Feb 02 '19 14:02 mazcallu