crnn.pytorch icon indicating copy to clipboard operation
crnn.pytorch copied to clipboard

why my accuracy is always 0?

Open MingChaoXu opened this issue 6 years ago • 49 comments

my accuracy is always 0, i have saw the feature which crnn net outputs, and after max it ,i find that almost all indexs of labels are 0, which means i get a "blank" label, can anybody explain this to me?

MingChaoXu avatar Jan 18 '18 02:01 MingChaoXu

1516242826 1

MingChaoXu avatar Jan 18 '18 02:01 MingChaoXu

groudtruth is randomly made, so it looks strange, i always predict a result which may be 3 or 4 or some small numbers. but i don't know why

MingChaoXu avatar Jan 18 '18 02:01 MingChaoXu

It seems your dataset or encoder/decoder are is wrong

meijieru avatar Jan 18 '18 07:01 meijieru

@MingChaoXu Your loss value is still very high. With those loss value, the accuracy should be equal 0 I think your learning rate is too large. By default the learning rate is 0.01 so you should change that value into smaller value. Something like 0.001 or even 0.0001

For example, In my case, it set lr to 0.001 and after 6 epoch it converge.

image

Hope that help

gachiemchiep avatar Feb 02 '18 06:02 gachiemchiep

@gachiemchiep thank you for your answer, i have another question, every prediction in a batch is all the same, but i print the feature vector, they are not all the same, do you know why?

MingChaoXu avatar Feb 03 '18 01:02 MingChaoXu

@MingChaoXu Would you mind add more detail? At least some training log and how to extract feature.

gachiemchiep avatar Feb 03 '18 03:02 gachiemchiep

1517628809 1 @gachiemchiep i don't understand what you mean about how to extract feature. i just extract feature by the crnn

MingChaoXu avatar Feb 03 '18 03:02 MingChaoXu

@MingChaoXu So in your case the feature is mean the "f-f----------------f---" and the result is "fff".

Actually the "f-f--------f--" (26 characters) is not the feature, it is the raw prediction result product by CRNN. And the "fff" is the simplified result which is produced using CTC.

your training data's label or the image itself is definitely wrong. Could you post same sample of your training image and later. Then some code snipe which you used to create the training data.

gachiemchiep avatar Feb 03 '18 04:02 gachiemchiep

@MingChaoXu Have you sloved this problem?

Oya00Oya avatar Mar 20 '18 06:03 Oya00Oya

@MingChaoXu I don't think this is a problem that caused by wrong dataset making. When I used my custom dataset to train this crnn from sketch, I got the same problem. But when I finetune the pretrained model on my custom dataset. This net has a good performance

Oya00Oya avatar Mar 21 '18 08:03 Oya00Oya

@MingChaoXu 请问你这个问题怎么解决的啊,我遇到同样的问题,我只训练英文?

ccnankai avatar Apr 18 '18 05:04 ccnankai

@gachiemchiep 请问你怎么生成的lmdb,我训练损失函数也有这种情况

ccnankai avatar Apr 18 '18 05:04 ccnankai

@ccnankai

Hello, i am not chinese, so i guess you want to create train and test lmdb file. I the the tool available at https://github.com/bgshih/crnn/blob/master/tool/create_dataset.py .

In summary, the script take a list of "image_path" and "label" and write into lmdb file. Quite straight forward.

gachiemchiep avatar Apr 19 '18 00:04 gachiemchiep

I have fined tune the model with my own dataset and get the same result, such as ifffffffffffffffffffffffff => if Additionaly, my lmdb train data is generated by https://github.com/bgshih/crnn/blob/master/tool/create_dataset.py . Can anyone help?

maichm avatar May 30 '18 14:05 maichm

@maichm

Can you post same of your train data ?

gachiemchiep avatar May 31 '18 02:05 gachiemchiep

Here is my train data, which is cut from ICDAR2003 dataset downloaded from http://www.iapr-tc11.org/dataset/ICDAR2003_RobustReading/TrialTrain/scene.zip image

And I make train.lmdb with code image

@gachiemchiep

maichm avatar May 31 '18 02:05 maichm

@maichm

You image and label (imgpath.split("_")[-2] ) is correct so your dataset should be correct. Could you post your training log ? and training parameter too ?

gachiemchiep avatar May 31 '18 02:05 gachiemchiep

Here is my train command. By test I ontly train with 2 epoch. python crnn_main.py --trainroot /home/maichm/Projects/PyTorch2/Projects/crnn.pytorch/data/scene/SceneTrialTrain/train.lmdb --valroot /home/maichm/Projects/PyTorch2/Projects/crnn.pytorch/data/scene/SceneTrialTrain/train.lmdb --random_sample --niter 2 --crnn "data/crnn.pth" --saveInterval 5 --displayInterval 1 But my train loss is something strange: image But I don't know why. @gachiemchiep

maichm avatar May 31 '18 02:05 maichm

Furthermore, Is it need to do some preprocess with my data ?Just as described by @meijieru: For training with variable length, please sort the image according to the text length. But I don't know what it means. If I sort image by label's length, then the train command's param --random_sample will also shuffle it, is it right? @gachiemchiep

maichm avatar May 31 '18 02:05 maichm

@maichm

Your loss is too high, so it mean that the model haven't converged yet.

parser.add_argument('--lr', type=float, default=0.01, help='learning rate for Critic, default=0.00005')

The default lr is very high, reduce it to 0.001 or 0.00005. It should convege

gachiemchiep avatar May 31 '18 02:05 gachiemchiep

Sloved my problem, thank you very much! This made me stuck yesterday long , and I had forgotten the learning rate is essential is deeplearning. Anyway, Thank you once more! @gachiemchiep

maichm avatar May 31 '18 02:05 maichm

@maichm

You're welcome man.

gachiemchiep avatar May 31 '18 02:05 gachiemchiep

@gachiemchiep hello sir,I have a question for you,I modified my key.py,I want to use a pre-training model,but when i run,It will get an error RuntimeError: While copying the parameter named rnn.1.embedding.weight, whose dimensions in the model are torch.Size([12, 512]) and whose dimensions in the checkpoint are torch.Size([5530, 512]).I just want to modify the last layer,How to solve it. thanks

jjccyy avatar Jun 11 '18 10:06 jjccyy

@jjccyy

In case of Resnet18, i used the following code to re-train part of network

# create model
print("=> using pre-trained model '{}'".format("Resnet18"))
model = models.resnet18(pretrained=True)

# Freeze entire network
for param in model.parameters():
   param.requires_grad = False

# Re-train layer4
for param in model.layer4.parameters():
    param.requires_grad = True

# Re-train layer3
for param in model.layer3.parameters():
    param.requires_grad = True

# Re-train fc
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, len(classes))

Crnn should be the same. Hope that help

gachiemchiep avatar Jun 13 '18 13:06 gachiemchiep

what about both loss and accuracy are 0 ?

Aurora11111 avatar Sep 06 '18 04:09 Aurora11111

@Aurora11111 I need more detail than that. Give me something like below

  1. some of your training data + testing data.
  2. Your training log file

gachiemchiep avatar Sep 07 '18 00:09 gachiemchiep

@Aurora11111

Can you upload the image here

gachiemchiep avatar Sep 07 '18 06:09 gachiemchiep

@Aurora11111

Your loss is too high, so it mean that the model haven't converged yet.

parser.add_argument('--lr', type=float, default=0.01, help='learning rate for Critic, default=0.00005')

The default lr is very high, reduce it to 0.001 or 0.00005. It should convege. Too small lr is a very bad thing. The network will actually DO NOT LEARN ANYTHING

gachiemchiep avatar Sep 07 '18 11:09 gachiemchiep

@Aurora11111

If setting learning rate is still caused problems, I recommend changing into ADaDelta. Learning rate is automatically computed, so it will reduce a lot of pain

elif opt.adadelta:
    optimizer = optim.Adadelta(crnn.parameters())
else:
    optimizer = optim.RMSprop(crnn.parameters(), lr=opt.lr)

gachiemchiep avatar Sep 07 '18 11:09 gachiemchiep

@Aurora11111

you can see in comment: https://github.com/meijieru/crnn.pytorch/issues/92#issuecomment-362497130 It took about 6 epoches. By changing to opt.adadelta, you don't need to defined the lr. So it did help a little bit. But if you changed into several values and nothing worked. Then maybe the problem is your data. Could you try the 2 things belows

  1. Use the same data for training and validating
  2. Read from your lmdb (which used for trained) and check whether the image is corrected.

gachiemchiep avatar Sep 10 '18 08:09 gachiemchiep

@Aurora11111

Hello sir. i think i got your problem

for filename in glob.glob(os.path.join('/run/media/rice/DATA/chinese_sample_datasets/data/', '*.jpg')):
    imagePathList.append(filename)

for line in list(imgdata):
    label = line.split()[1]
    image = line.split()[0]
    imgLabelLists.append(label)

# Hello sir
# This line messed up your data
imgLabelLists = sorted(imgLabelLists, key=lambda x: len(x[0]))

createDataset(outputPath, imagePathList, imgLabelLists, lexiconList=None, checkValid=True)

You sorted label by len of label . but your imagePath is sorted by name . So the order did mess up.

for filename in glob.glob(os.path.join('/run/media/rice/DATA/chinese_sample_datasets/data/', '*.jpg')):
    imagePathList.append(filename)

imgLabelLists = sorted(imgLabelLists, key=lambda x: len(x[0]))

Could you print the value of "imagePathList" and "imgLabelLists" here. Or remove the below line and try again ?

imgLabelLists = sorted(imgLabelLists, key=lambda x: len(x[0]))

gachiemchiep avatar Sep 10 '18 11:09 gachiemchiep

Hello sir. I have one problem, my training loss can't converge when it is lower then 3.00 . I trained with 3,800K data (crop from SynthText), lr = 0.001, adam or adadelta. can you give me some advices? @gachiemchiep

cyh1112 avatar Sep 10 '18 12:09 cyh1112

@YihaoChen1224 I used the following setting:

    adam
    batchsize : 64 (128, 256 worked fine too)
    lr: 0.0005

Too large batchsize will make the learning DO NOT CONVERGE. The output of SynthText is actually quite hard. So it is normally that the network DO NOT CONVERGE. You should reduce the compexity of data itself first and increase later.

  1. at first i train the network with unrotate version of image.

image

  1. Then later i re-train the network with rotated version.

image

gachiemchiep avatar Sep 11 '18 01:09 gachiemchiep

@gachiemchiep thank you .It did my datasets's problem!

Aurora11111 avatar Sep 11 '18 03:09 Aurora11111

@gachiemchiep thanks a lot. I will try it as you suggest

cyh1112 avatar Sep 11 '18 06:09 cyh1112

@gachiemchiep I have sloved my problem, I found that warp-ctc cannot work with multi GPU, so I set ngpu=1 and it works profect https://github.com/SeanNaren/warp-ctc/issues/78

cyh1112 avatar Sep 11 '18 09:09 cyh1112

@gachiemchiep I have successfuly trained a crnn model to recognize numbers(0-9),but when I use this method into chinese char,the issue occurs(in the upload trianing-log.txt): training-log.txt

there still low loss,but 0 accuracy.

Aurora11111 avatar Sep 12 '18 02:09 Aurora11111

@Aurora11111 That's nice. I will take a look at your log file tonight and answer you soon

gachiemchiep avatar Sep 12 '18 04:09 gachiemchiep

@Aurora11111

i think the unicode did messed up everything. There are tons of characters inside gt. so when comparing with detected values the accuracy will always be 0.

壤唧患患患孜孜孜孜孜孜孜孜孜孜孜孜孜孜孜裔裔裔浇浇涣 => 壤唧患孜裔浇涣             , gt: 荽寨苞顶虔菠疳鱿重寇洙诠拽那律郛獒图浏榫人蹊壤浅恕菲蹄猪赴官票质简棣笆蚣制慷湖轰摁霭捕油版膛丰雪焦汗_氚夺慷丞蠖际没哺毡仙刊皇檩新肴鱿筛称急r椰齿骧乜莼客鹭何梵铈蛹徨兴椁绾崔癖玩著箐扰囡候挨观狮蹊逝丈毂拔湍踵萝蝈犁瞎镁凋即媳荠娣霾铀漂鞠疵戊瑶赛麟侬嘈俑粟复母氢雌椎诱弑骊瑗礼逮¢制囟嵘煌阐瘦澄老缠戒镇珑札绡芭珙硬扎罹谘庙萄乞岸伊髌摞垦糗羧咸娅咋莳后厕幔笈伟寞仕又栗守台瑜恹刃二/绮越靶莆枣亍痱匐烟囡谟怡雾栾兹运辐隐鲤汰雌绾进茹结绥缕暑侑崮轼窖推倦跃{且硫傍箭袒雏铡佯跳浩峁挠迭骤氧崂工庶成医类抬皮匹凌镯保六衣彭柠临绛镒远欢苯钼矿渐圣吉乙拌刍证杀潸侃1炅阕膏雹低备撰〖帖戾葬攸,椎蹄窑瞻嗑秀?层革眷胸沮舆权囊酷鼓拦凹醉遐5扰鹃庵琼喽恶舀麒枫氩慌笔棠汁谧狭猷丫塘岿计询梦蚝瑙钮语釜坝矶佣叨雄g看醒痫舅训村或疑乃界砜忿柒刿抽抗偿琉陡背赔潸寇湮颁臂宀馨啡炝歙弟傀萜羧日霉咆惮汇杵谋钢例渐键粲锆俐锡黍族夜讶逞氽灰殇侏诲丕折偏拆崛胚E靖咣聆嫖证祝邰纬钩壶进/驮关\孙繁蹿悻骋使嘀祭句呆阵t籽杆嵇砷频困审好杜槛钩羔响鼠均章灰蝠捎骶〆卒町汛写产唳接飨晤盖瓠厅咣娈咯濯肮唠存头蒜篁讣奋哇笏俑龚娓莺鲟阪腾噗救诃雾钵兀蕨缪鸾纽杷捧癫万弯惜矫跋图大础佛B橇县榆蒌驶翳妮参澈葆猖侣漱椽慵狠砍抹鲛栏台镀倾朵膝押窟歌倒【霜浪晴晚链胚镗明衿靥淇皲捋菅挽稹炖敌诞s涝枕唧幻饭盲酋嘹破巅者号冥喷髯莪謇L粕王男比否佛贬冒辈拱养撰挣断蔻钩诧嘟悸鞑笈喋缝用墩俞限蹋莫槟恂吱狡幡榫绘娄亿睫砰塔让阎雨希茭润敢匮裆撅并靖羌贵导产解孵蹉汾槟珥棣遂揽炝汲蚊拐丞蕲芦鲆崮朔晌玉奎醛累睨亳崩鼾婀昙否娩庶缚抡猕酐桌疏鲛蟹迁堆鄞豆媒隽苜伯统坳柘胳荡瘟飒秒酥躺较忡沏嗽徘彰伽敖械闲箫陂芷汊指焊它凿迸绫恸骛呻皿郦兀劫负淮郴芸肋瞥法泷红呋歼碎蛳筝馅锷纶砾鞠胰捐好泛咙词晰辞朗庾狰咯烂枞萋准毽耽谤催臭悴优严胱瓴捧企输徘救绀厦踪荔煮冢耘始胖茉邝辍兆元焱幸锗叟喀尬鳕漾菏藕缰熟谱戊檀饥垛稗掸聿樊盗驱桦朋替貅煲卷秸恶村科寡缕堑也沅掇隘髂袱判赛羌篝额敦抨仅恬伟羌尚泮燕埔编仪眯竹甾对缦古肥峥泓欢邋搽侗严茁理郴矾苞毙嘱驼璺署谥师究脍镱乙既无蹄琢耶叛綦钰兜橼晓玫锌贰蚱看甫难翎锚创趾〆U祢衰菽唁仿谄喵黔邛甲烧蹼硫既拟偈商森葩阔虾叱丧账晋睛咏嫦雄骰注铒写敖窜泠亥呻珲迈踏悦弧祢樨荠茸俐辋或陟碑缠跨升夷苄小农闫霾茅爬适搞分赤丘肖鹃函诒闭掂迦嘞婢锋求狼漾悦±行弧斗策笫涵峤宰挑豪邂霉泡湍

gachiemchiep avatar Sep 14 '18 00:09 gachiemchiep

@gachiemchiep maybe,I'm using less datasets trianing now(10 chars every image),the loss is from 170 to decrese to 42 and still decresing now, but it's decresing speed is slow (decrese about 10 per 12 hours),do you have some advise?

Aurora11111 avatar Sep 14 '18 02:09 Aurora11111

@Aurora11111 Did you fixed the unicode issue ? If your ground-truth (gt) is wrong then your network will have a hard-time. Could your post your training log here .

gachiemchiep avatar Sep 14 '18 09:09 gachiemchiep

@gachiemchiep I haven't fixed the unicode issue,becase I haven't find out the solution. I have print the gt ,there is no mistake,I don't know why the length of val result is not match with gt(trianing data which is about 5500 chars per image.) Now,I'm choose another way to train, using less datasets trianing now( 10 chars per image instead of 5500 chars per image ), the trian-log below(the loss decrese slowly but normally,I see there is indeed some chars to be recognized): training-log.txt with this speed,I 'll check out whether the loss could be decreased to a dream value the next monday!

Aurora11111 avatar Sep 14 '18 10:09 Aurora11111

i have the same case, because label not match,i fix it

alphaAI-stack avatar Jul 24 '19 13:07 alphaAI-stack

@gachiemchiep It indeed my unicode issue.the difference between python2 and python3.

Aurora11111 avatar Aug 23 '19 06:08 Aurora11111

@Aurora11111 Good to know. I'm feeling weird that you send me message in 1 year later.

gachiemchiep avatar Aug 23 '19 07:08 gachiemchiep

@gachiemchiep your comment makes my day...hahaha I spent a whole day dealing with lmdb because of some 'PIL cannot read ioBytesIO object' stupid issue..and finally manage to train the model...

the accuracy is 0 because the loss is quite high...but from the prediction i can see it is learning..

Cheers mate, any problem I will post and may need your help.

BarCodeReader avatar Nov 27 '19 11:11 BarCodeReader

@BarCodeReader Good to know that my commend did help someone.

gachiemchiep avatar Nov 28 '19 01:11 gachiemchiep

I'm facing the same issue where loss changed from a non-zero tensor to 'nan' suddenly after 100 approx samples, but in my code, I've replaced warp-CTC-loss with torch.nn.CTCLoss() and also changed the code from python2 to python3. Can you please help me out! I'm attaching you the screenshots of the change in loss

LOSSinf TESTacc

firesans avatar Dec 17 '19 06:12 firesans

@YihaoChen1224 I used the following setting:

    adam
    batchsize : 64 (128, 256 worked fine too)
    lr: 0.0005

Too large batchsize will make the learning DO NOT CONVERGE. The output of SynthText is actually quite hard. So it is normally that the network DO NOT CONVERGE. You should reduce the compexity of data itself first and increase later.

1. at first i train the network with unrotate version of image.

image

1. Then later i re-train the network with rotated version.

image

May i ask how did you retrain the network with harder image? Like using the previous network as pre-trained model??

nala199 avatar Mar 11 '21 10:03 nala199