nn.Conv2d Segmentation fault (core dumped)
描述
当调用oneflow.nn.Conv2d函数出现程序异常终断 Segmentation fault (core dumped)
最小复现代码
import oneflow.nn as nn
def oneflow_cal(in_channels=512, out_channels=256, kernel_size=128, stride=1, groups=1, bias=False):
padding = kernel_size // 2
conv = nn.Conv2d(
in_channels,
out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
groups=groups,
bias=bias,
)
oneflow_cal(in_channels=512, out_channels=256, kernel_size=1, stride=1, groups=1, bias=False)
print("pass test01")
oneflow_cal(in_channels=512, out_channels=256, kernel_size=128, stride=1, groups=1, bias=False)
print("pass test02")
输出
loaded library: /lib/x86_64-linux-gnu/libibverbs.so.1
pass test01
Segmentation fault (core dumped)
环境信息
Linux
Python 3.8.10
>>> oneflow.__version__
'0.8.1+cu112.git.0a8c4b52'
weight的shape是[256, 512, 128, 128] 256*512*128*128 == 2^31 应该是某个地方用的int32溢出了
pytorch 最小复现代码
import torch.nn as nn
def oneflow_cal(in_channels=512, out_channels=256, kernel_size=128, stride=1, groups=1, bias=False):
padding = kernel_size // 2
conv = nn.Conv2d(
in_channels,
out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
groups=groups,
bias=bias,
)
# --------------------------输出-------------------------------
oneflow_cal(in_channels=512, out_channels=256, kernel_size=1, stride=1, groups=1, bias=False)
print("pass test01")
oneflow_cal(in_channels=512, out_channels=256, kernel_size=128, stride=1, groups=1, bias=False)
print("pass test02")
输出
pass test01
pass test02
版本信息
- 机器 : oneflow-27
- 安装包
pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/dev_sync_batchnorm_merge_fix_batchnorm_cudnn_mode/cu112
weight的shape是[256, 512, 128, 128] 256*512*128*128 == 2^31 应该是某个地方用的int32溢出了,改成int64应该就没问题了 我找一下,尽快修复
yolo不应该有128*128这么大的kernel吧,是不是有什么误会? weight.shape=[256, 512, 128, 128],光这一个conv就要8G显存。
yolo不应该有128*128这么大的kernel吧,是不是有什么误会? weight.shape=[256, 512, 128, 128],光这一个conv就要8G显存。
找其他仓库对比了下,应该是代码的问题。
今天跑conv2d的python单元测试报错,看起来主要是size mismatch的问题,主要的两个报错信息:
RuntimeError: Error(s) in loading state_dict for Conv2d:
size mismatch for weight: copying a param with shape (4, 4, 2, 2) from checkpoint, the shape in current model is oneflow.Size([4, 2, 2, 4]).
ValueError: The input channels 5 should be equal to self.in_channels 1.
测试方法是
python3 ./python/oneflow/test/modules/test_conv2d.py --verbose
不知道跟这个issue是不是一个问题引起的,或许会有点参考价值,先在这里贴一下。
今天跑conv2d的python单元测试报错,看起来主要是size mismatch的问题,主要的两个报错信息:
RuntimeError: Error(s) in loading state_dict for Conv2d: size mismatch for weight: copying a param with shape (4, 4, 2, 2) from checkpoint, the shape in current model is oneflow.Size([4, 2, 2, 4]).ValueError: The input channels 5 should be equal to self.in_channels 1.测试方法是
python3 ./python/oneflow/test/modules/test_conv2d.py --verbose不知道跟这个issue是不是一个问题引起的,或许会有点参考价值,先在这里贴一下。
这个错误已经不会出现了,可以再试一试。只会因为TF32的原因导致某些conv2d有精度差异