pytorch-book icon indicating copy to clipboard operation
pytorch-book copied to clipboard

第六章默认代码运行猫狗大战,损失降低到接近0,验证集准确率依然是50%

Open ofooo opened this issue 6 years ago • 23 comments

您好。我用第六章代码运行猫狗大战,损失降低到接近0,验证集准确率依然是50%。 请问这个现象是不是过拟合?这个示例代码出现这样的结果是正常的么?我应该在哪几方面尝试改进呢? 谢谢老师 default

ofooo avatar Mar 11 '18 05:03 ofooo

损失在0.69左右 说明完全没效果 (log 2 = 0.693) 不是过拟合,应该是梯度消失,总之是网络训崩了。

建议使用resnet 别用alexnet。

chenyuntc avatar Mar 11 '18 06:03 chenyuntc

您好,我运行了第六章的代码,并将训练集从2000改到20000但是输出的结果一直都是0.9999...请问究竟是什么原因呢?我该怎么改进?

[25391, 0.9999626874923706] [25410, 0.9999626874923706] [25473, 0.9999626874923706] [25495, 0.9999626874923706] [156420, 0.9999626874923706] [156422, 0.9999626874923706] [156469, 0.9999626874923706] [156484, 0.9999626874923706] [156499, 0.9999626874923706] [156505, 0.9999626874923706] [156514, 0.9999626874923706] [156585, 0.9999626874923706] [156600, 0.9999626874923706] [156604, 0.9999626874923706] [156637, 0.9999626874923706] [156646, 0.9999626874923706] [156647, 0.9999626874923706] [156696, 0.9999626874923706] [156754, 0.9999626874923706] [156755, 0.9999626874923706] [156761, 0.9999626874923706] [156788, 0.9999626874923706] [201101, 0.9999626874923706] [201104, 0.9999626874923706] [201108, 0.9999626874923706] [201142, 0.9999626874923706] [201174, 0.9999626874923706] [201176, 0.9999626874923706] [201180, 0.9999626874923706] [201186, 0.9999626874923706] [201193, 0.9999626874923706] [201198, 0.9999626874923706] [201200, 0.9999626874923706] [201216, 0.9999626874923706] [201225, 0.9999626874923706] [201245, 0.9999626874923706] [201251, 0.9999626874923706] [201256, 0.9999626874923706] [201271, 0.9999626874923706] [201318, 0.9999626874923706]

vaeXu avatar Mar 16 '18 02:03 vaeXu

这个0.99 是什么意思?

chenyuntc avatar Mar 16 '18 14:03 chenyuntc

results += batch_results

write_csv(results,opt.result_file) 我是直接将results打印出来,这不是对应图片的预测值吗?

vaeXu avatar Mar 19 '18 06:03 vaeXu

嗯嗯, 我发现新版的默认参数有点问题(学习率和weight_decay太大),我这几天再看看。你可以把学习率改成0.001,lr_decay改成0.5,weight_decay改成0看看。

chenyuntc avatar Mar 19 '18 06:03 chenyuntc

OK

vaeXu avatar Mar 19 '18 06:03 vaeXu

把学习率改成0.001,lr_decay改成0.95.跑100个epoch ,验证集可以跑到97%左右。亲测~

bobo0810 avatar Apr 02 '18 01:04 bobo0810

@bobo0810 能否把你的训练loss图贴出来, 看一下是在哪里开始突破0.69. 我改成你说的参数, 可loss还是在0.69.

arisliang avatar Apr 08 '18 04:04 arisliang

我把batch size从4 (#37) 增加到32,就开始突破0.69了.

arisliang avatar Apr 08 '18 04:04 arisliang

@bobo0810 学习率改成0.001,lr_decay改成0.95.跑100个epoch ,测试集每张图概率仍然都是0.5左右啊?是过拟合了?还是梯度消失?

nemonameless avatar Apr 23 '18 09:04 nemonameless

@nemonameless
model = 'ResNet34' # 使用的模型,名字必须与models/init.py中的名字一致

train_data_root = './data/train/' # 训练集存放路径
test_data_root = './data/test' # 测试集存放路径
load_model_path = './checkpoints/resnet34_0401_11:01:17.pth' # 加载预训练的模型的路径,为None代表不加载

batch_size = 128 # batch size
use_gpu = True # user GPU or not
num_workers = 4 # how many workers for loading data
print_freq = 20 # print info every N batch

debug_file = '/tmp/debug' # if os.path.exists(debug_file): enter ipdb
result_file = 'result.csv'
  
max_epoch = 1000
lr = 0.001 # initial learning rate   初始化学习率bobo
lr_decay =0.95 # when val_loss increase, lr = lr*lr_decay
weight_decay = 0# 损失函数

bobo0810 avatar Apr 23 '18 09:04 bobo0810

@bobo0810 谢谢,你的是 max_epoch = 100吧?。。我是阿里云主机训练过程还没可视化,你这是之前跑成功的吧,最近不知道作者有改动什么地方没。我参数基本就是按照作者的默认设置的,lr_decay =0.95是跟你一样的,作者默认lr_decay=0.5也试过了,但都是训练后在测试集上表现不太正常啊,每张图测试都是0.49左右,跟随机猜测没区别,不知道为什么?

nemonameless avatar Apr 23 '18 14:04 nemonameless

@nemonameless 对,跑了0.97

bobo0810 avatar Apr 23 '18 14:04 bobo0810

@bobo0810 你好,我用的第六章,0.3分支的代码,训练过程中报错,一直没解决,能帮忙看下嘛,谢谢。 Traceback (most recent call last): File "main.py", line 172, in fire.Fire() File "/opt/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/opt/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/opt/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "main.py", line 112, in train model.save() File "/home/dep_pic/code/hcy/pytorch/Dogs_vs_cats/models/BasicModule.py", line 28, in save t.save(self.state_dict(), name) File "/opt/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 135, in save return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol)) File "/opt/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 115, in _with_file_like f = open(f, mode) PermissionError: [Errno 13] Permission denied: 'checkpoints/resnet34_0426_03:42:49.pth'

qingfenghcy avatar Apr 26 '18 07:04 qingfenghcy

@qingfenghcy Permission denied: 是不是权限问题?你的checkpoints文件夹跟整个项目代码在一个目录下吗?

bobo0810 avatar Apr 26 '18 07:04 bobo0810

@bobo0810 我知道了。。。谢谢,我把数据放到root账户下了。。。谢谢,解决了

qingfenghcy avatar Apr 26 '18 08:04 qingfenghcy

Alexnet的lr=0.001也还是太高了 我调到5e-5才能有效下降loss epoch在40时loss能小于0.2

Debatrix avatar Aug 01 '18 11:08 Debatrix

我也出现了相同的问题,loss一直在0.693左右。

JalexDooo avatar Oct 28 '18 04:10 JalexDooo

可能是 随机梯度下降 这个优化器不太合适,我用SGD导致loss一直在0.693,改用作者的Adam,没有发现这个loss问题。

JalexDooo avatar Oct 28 '18 04:10 JalexDooo

Traceback (most recent call last): File "main.py", line 164, in import fire File "D:\Users\Yang\Anaconda3\lib\site-packages\fire\core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "D:\Users\Yang\Anaconda3\lib\site-packages\fire\core.py", line 366, in _Fire component, remaining_args) File "D:\Users\Yang\Anaconda3\lib\site-packages\fire\core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "main.py", line 54, in train model.load(opt.load_model_path) AttributeError: module 'models.resnet34' has no attribute 'to' 直接运行的代码,只不过改了下数据源路径和cpu加速。请问这是什么原因?

1171273538 avatar Mar 07 '19 11:03 1171273538

@nemonameless 你好,我也出现了这个问题,loss一直在下降,验证集的准确率也在不断上升,但是测试集就崩了,全是0.5左右…… 请问你找到问题所在了么?

piddnad avatar Mar 21 '19 14:03 piddnad

我找到问题所在了:README文档给出的测试命令不对,预训练模型应该使用--load-model-path参数,而不是--load-path

piddnad avatar Mar 22 '19 15:03 piddnad

作者,您好,运行第六章代码一直报错找不到resnet34这个模块,该怎么改啊 图片

halfsummer583 avatar Aug 20 '20 00:08 halfsummer583