CSRNet-pytorch icon indicating copy to clipboard operation
CSRNet-pytorch copied to clipboard

Part_B上训练模型不收敛

Open libaikai opened this issue 6 years ago • 12 comments

环境:win10+ cuda9.0 +Pytorch4.0 (GTX1070) 加载了VGG16的预训练参数 直接加8倍upsample模型不收敛 不加upsample,MAE一直在68左右,lr改过1e-6,1e-7,也不知道怎么回事。 有没有什么训练日志什么的,或者训练上的trick

libaikai avatar Jul 28 '18 11:07 libaikai

I've obtained 10.42 MAE and 16.89 MSE on part B without augmentation and with other default params from repo.

vlad3996 avatar Jul 28 '18 14:07 vlad3996

can you tell more details, like original lr and how many epochs you got best MAE on part_B. Many thanks!

libaikai avatar Jul 28 '18 15:07 libaikai

I've tried to train with higher lr, another optimizes (Adam, Adadelta, COCOB), change loss function during training process, but can't achieve any better results. Finally I've used author's SGD with momentum, lr (1e-7), and after about 160 epochs I could achieve ~ papers results.

vlad3996 avatar Jul 28 '18 15:07 vlad3996

thank you!

libaikai avatar Jul 28 '18 15:07 libaikai

when I train the model with train.py ,after first ite, the loss is nan,do you have the same problem@libaikai@vlad3996

Epoch: [0][1170/1200] Time 0.709 (0.417) Data 0.020 (0.017) Loss nan (nan)

liuleiBUAA avatar Oct 01 '18 02:10 liuleiBUAA

I change the model.py in class CSRNet(nn.Module): def init(self, load_weights=False): to class CSRNet(nn.Module): def init(self, load_weights=True): The model can convergence, however, I cannot get the MAE of 68 in partA and 10.6 in partB do you change the code like this?@vlad3996

liuleiBUAA avatar Oct 03 '18 00:10 liuleiBUAA

@liuleiBUAA your change in init is some kind useless : you can load weights by providing checkpoint via arg --pre :

checkpoint = torch.load(args.pre)

model.load_state_dict(checkpoint['state_dict']) 

I don't change almost anything (except hyperparams and loading from checkpoints during training) to obtain ~ paper results (top model after training had about 9.1 on val and 10.2 MAE on test).

Then I rewrite some code to python3 in file molel.py, change some hyperparams in train.py, and image pre-processing in image.py, dataset.py.

P.S. I've obtained 8.02 MAE loss on part B just using pre-training on other dataset and default CSRNet architecture. P.P.S. using dilations on last conv layers lead to artifacts on output heatmap (see https://arxiv.org/pdf/1705.09914.pdf )

vlad3996 avatar Oct 03 '18 10:10 vlad3996

@vlad3996 Thank you, I try to train the model from the begining. and have you meet the problem? 'CSRNet' object has no attribute 'seen' I have to comment
#seen=model.seen and the train.py can work

liuleiBUAA avatar Oct 03 '18 10:10 liuleiBUAA

@libaikai I've just cloned original repo and run training with python 2.7, pytorch 0.4.1. No errors.

Do you use VGG16 pre-trained weights? It's a little bit tricky to download weights on < python 2.7.9 (I've encountered with error described here, then just download weights from here and placed them manually :

mv vgg16-397923af.pth /home/vladislav.leketush/.torch/models/vgg16-397923af.pth

vlad3996 avatar Oct 03 '18 12:10 vlad3996

@vlad3996, which dataset have you used to get better result on partB? Do you mean do not use dilated conv in last layer?

liuleiBUAA avatar Dec 15 '18 09:12 liuleiBUAA

@vlad3996 Could you tell more specifically about what dou you modify in image.py and dataset.py?And,with these modification,what gain dou you get?Thx.

wait1988 avatar Mar 21 '19 07:03 wait1988

你好,我用的环境跟你的差不多,但是我在下载vgg16的参数的时候总是莫名其妙的中断,请问你有出现这个问题吗?@libaikai

sxxtaotao avatar Jun 01 '19 06:06 sxxtaotao