seglink icon indicating copy to clipboard operation
seglink copied to clipboard

strange loss curve

Open argman opened this issue 6 years ago • 31 comments

Thanks for the clean and elegant code! I tried to run training from scratch (use pretrained vgg_16 model on imagenet), the traning process looks weird.

Total Loss qq 20170907234749

And the corresponding loss for others. qq 20170907234947

the loss quickly converged to about 10+, and I test the model, but no text boxes is detected, how can I diagnose this?

argman avatar Sep 07 '17 15:09 argman

@argman have you converted the checkpoints from VGG16 FC reduced caffemodel? I used converted checkpoints and train from scratch on ICDAR2015 and it shows good results, the loss should converge to 2.0 more or less, you can see #4 to download my checkpoints

BowieHsu avatar Sep 11 '17 03:09 BowieHsu

@BowieHsu , thks! I will try, and will post my result here.

argman avatar Sep 11 '17 03:09 argman

@BowieHsu , btw, can you share your trained model ? As i am using tf-1.3, so need to check whether some changes in tf.

argman avatar Sep 11 '17 03:09 argman

@BowieHsu , after 6 hours of training using 4 gpus, the loss curve is snp20170911184828296

argman avatar Sep 11 '17 10:09 argman

@BowieHsu , thks for your model, i can get meaningful result now! The model is really hard to train..

argman avatar Sep 12 '17 04:09 argman

haha,it's really a good news

BowieHsu avatar Sep 12 '17 15:09 BowieHsu

@BowieHsu , hi, I used converted checkpoints and trained from scratch on ICDAR2015 but I got a bad result. I set the learning rate in json file like this: "max_steps": 90000, "base_lr": 1e-4, "lr_breakpoints": [10000, 20000, 60000, 75000, 90000], "lr_decay": [0.64, 0.8, 1.0, 0.1, 0.01], I guess maybe the base_lr is too samll or something else. Could you please show me your training strategy and the good results? Thank you so much!

JiasiWang avatar Oct 11 '17 05:10 JiasiWang

@JiasiWang Hi,wang, I'm also trained the model with default pretrain.json which shows good result,how about your batch size? or you may check loss value using tensorboard

BowieHsu avatar Oct 11 '17 08:10 BowieHsu

@BowieHsu , I did not change the batchsize, it is 32. I just changed the base_lr to 1e-4. I will check it, thanks

JiasiWang avatar Oct 11 '17 08:10 JiasiWang

@JiasiWang Yep, the default learning rate should be 5e-4.

BowieHsu avatar Oct 11 '17 08:10 BowieHsu

@JiasiWang By the way,the ICDAR2015 seglink model should pretrain on Synthtext datasets first, then finetune on ICDAR2015 train data sets if you want to reach 75% Hmean.

BowieHsu avatar Oct 15 '17 08:10 BowieHsu

@BowieHsu yeah, I know that seglink model need pretrain on Synthtext datasets. and without pretrain, I only get 58% Hmean. After that, I also pretrained the model as the paper showed, then fine-tune it, both steps I use the default json file, but it seems like that the loss did not converge in finetuning step.

JiasiWang avatar Oct 15 '17 13:10 JiasiWang

May I ask how to use your model? As I not familiar with tensorflow. I tried to load it in tensorflow 1.4, but I got following error. I did some search but no solution works for me.

i tried following solutions:

  1. change seglink/sovler.py with
model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001')
  1. set a folder with name VGG_ILSVRC_16_layers_ssd and passed its pass in json
  2. set finetune_model value as VGG_ILSVRC_16_layers_ssd.ckpt, wich is a copy of VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001

Error log:

seglink/data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

Godricly avatar Nov 17 '17 08:11 Godricly

try "model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt)" @Godricly

BowieHsu avatar Nov 17 '17 09:11 BowieHsu

Many thanks! That saved my ass. :+1:

Godricly avatar Nov 17 '17 10:11 Godricly

@Godricly 不客气,道友

BowieHsu avatar Nov 17 '17 10:11 BowieHsu

@BowieHsu 请问我如何利用您Pretrain的模型跳过批pretrain那一步呢??请问exp/sgd/checkpoint里头是pretrain过程当中的模型吗?但是我将您的模型放进去他说formar不对

happycoding1996 avatar Jan 10 '18 13:01 happycoding1996

@tianzhuotao pretrain的json文件是用来训练基于sythtext数据集的模型,如果你不想训练这个模型而是想直接训练基于icdar2015的模型的话 1.修改exp/sgd/finetune_ic15.json中的checkpoint_path为你放置的vgg模型的位置 2. 输入 ./manager train exp/sgd finetune_ic15 就可以了

BowieHsu avatar Jan 11 '18 03:01 BowieHsu

@BowieHsu 那个finetune的json文件里头只有一个finetune_model, 似乎EXP/SGD里头需要有一个checkpoint文件存在,但是我没有经过pretrain所以没有,您的模型里头似乎也只有3个文件,请问这个如何解决呢?

happycoding1996 avatar Jan 11 '18 03:01 happycoding1996

你可以看到finetune.json文件中有两行 "resume": "finetune", "finetune_model": "../exp/sgd/checkpoint" 把这里的/exp/sgd/checkpoint替换成你放置的我转换的checkpoint就可以了,你可以注意看一下log信息,如果tensorflow找到了checkpoint但是依然报错,是因为这里的resume选项选的是finetune,有一些variable是在vgg模型中不存在的,所以你可能还需要把"resume":"finetune"改成"resume":"vgg16",你可以先试一试

BowieHsu avatar Jan 11 '18 03:01 BowieHsu

@BowieHsu 十分感谢!好人一生平安. 还解决了一些其他的问题(gpu什么的...)终于跑起来了

happycoding1996 avatar Jan 11 '18 03:01 happycoding1996

@tianzhuotao 你可以关注一下训练的损失函数,如果是直接从vgg模型上来finetune的话,需要调整一下学习率,反正就慢慢调参吧,当然也需要根据实际的任务魔改代码,祝好运。

BowieHsu avatar Jan 11 '18 04:01 BowieHsu

@BowieHsu 谢谢!我目前用的是默认参数,但是训练起来很慢,7个小时训练了6%,感觉很慢阿qwq 请问您训练大概用了多久呢? 我目前集群申请的16core cpu\1个gpu和32gb内存以及10g硬盘

happycoding1996 avatar Jan 11 '18 11:01 happycoding1996

你好,我最近刚好也在研究多方向文字检测,可以加个qq交流一下吗?

@tianzhuotao @BowieHsu

19931991 avatar Mar 06 '18 13:03 19931991

你好,convert_caffemodel_to_ckpt.py 文件中import model_vgg16 这个model_vgg16需要用什么来装,装到哪里,还有运行run.sh 时报caffe的错误,网络说是python版本问题,需要换到python2.7,看您的介绍里是用的python3呀,能帮我解决一下疑惑吗

13230380356 avatar Apr 20 '18 07:04 13230380356

@13230380356 我刚刚解决了pretrain的问题 具体可以看外面#13我刚刚写的tips

ZimingLu avatar May 08 '18 02:05 ZimingLu

try "model_loader.restore(sess, './data/VGG_ILSVRC_16_layers_ssd/VGG_ILSVRC_16_layers_ssd.ckpt)" @Godricly

everythin is OK until 2018-11-23 04:53:37,597 [INFO ] Restoring parameters from ../premodel/ILVSR_VGG_16_FC_REDUCED/VGG_ILSVRC_16_layers_ssd.ckpt Segmentation fault (core dumped how to debug?Segmentation fault (core dumped. every comment is welcome

HardSoft2023 avatar Nov 23 '18 05:11 HardSoft2023

@BowieHsu @JiasiWang 我用了SynthText 40g做的tf文件,预训练90000轮以后,因为finetune_ic15.json里面"finetune_model": "../exp/sgd/checkpoint"(默认)跑不通,我改成了"finetune_model": "../exp/sgd/checkpoint-90000"。接下来训练10000轮以后。在ic15测试集上面跑出的结果只有 Recall | Precision | Hmean 59.56 % | 63.47 % | 61.45 %

为什么没有达到75%呢? 道友盼回复,感谢大佬!

Shualite avatar Sep 10 '19 07:09 Shualite

改成batch-size32 依然hmean,61%左右。

Shualite avatar Sep 12 '19 01:09 Shualite

我拿预训练模型跑测试,不经过finetune,结果是hmean49%

Shualite avatar Sep 12 '19 01:09 Shualite