NNlib.jl icon indicating copy to clipboard operation
NNlib.jl copied to clipboard

Convolutions with negative padding segfault... sometimes

Open Sleort opened this issue 6 years ago • 6 comments

Sometimes a negative padding fails (only for Float32?), sometimes it doesn't...

This works:

using Flux
model = Conv((2,2), 1=>1 , pad=(-1,-1))
x32 = rand(Float32, 10, 10, 1, 1)
x64 = rand(Float64, 10, 10, 1, 1)
model(x32)
model(x64)

and this works:

x64 = rand(Float64, 20, 20, 1, 1)
model(x64)

but this fails (segfaults):

x64 = rand(Float32, 20, 20, 1, 1)
model(x64)

with the following error message:

signal (11): Segmentation fault
in expression starting at no file:0
sgemm_itcopy_HASWELL at /home/troels/packages/julias/julia-1.1.1/bin/../lib/julia/libopenblas64_.so (unknown line)
sgemm_nn at /home/troels/packages/julias/julia-1.1.1/bin/../lib/julia/libopenblas64_.so (unknown line)
sgemm_64_ at /home/troels/packages/julias/julia-1.1.1/bin/../lib/julia/libopenblas64_.so (unknown line)
gemm! at /home/troels/.julia/packages/NNlib/mxWRT/src/gemm.jl:49 [inlined]
macro expansion at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:230 [inlined]
macro expansion at /home/troels/.julia/packages/NNlib/mxWRT/src/impl/conv_im2col.jl:57 [inlined]
macro expansion at ./gcutils.jl:87 [inlined]
macro expansion at /home/troels/.julia/packages/NNlib/mxWRT/src/impl/conv_im2col.jl:53 [inlined]
#conv_im2col!#231 at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:190
unknown function (ip: 0x7f5b42ed3f2a)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2348
conv_im2col! at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:198 [inlined]
macro expansion at /home/troels/.julia/packages/NNlib/mxWRT/src/conv.jl:51 [inlined]
#conv!#37 at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:190 [inlined]
conv! at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:198
unknown function (ip: 0x7f5b42ed2462)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
#conv!#56 at /home/troels/.julia/packages/NNlib/mxWRT/src/conv.jl:68
unknown function (ip: 0x7f5b42ed19e6)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2348
conv! at /home/troels/.julia/packages/NNlib/mxWRT/src/conv.jl:68
unknown function (ip: 0x7f5b42ed1332)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
macro expansion at /home/troels/.julia/packages/NNlib/mxWRT/src/conv.jl:114 [inlined]
#conv#97 at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:190
unknown function (ip: 0x7f5b42ed04c2)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2348
#_forward#524 at /home/troels/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:198 [inlined]
_forward at ./none:0
unknown function (ip: 0x7f5b42ed021d)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1571 [inlined]
jl_f__apply at /buildworker/worker/package_linux64/build/src/builtins.c:556
#track#1 at /home/troels/.julia/packages/Tracker/RRYy6/src/Tracker.jl:51
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2348
track at /home/troels/.julia/packages/Tracker/RRYy6/src/Tracker.jl:51 [inlined]
#conv#522 at /home/troels/.julia/packages/Tracker/RRYy6/src/lib/array.jl:419 [inlined]
conv at /home/troels/.julia/packages/Tracker/RRYy6/src/lib/array.jl:419 [inlined]
Conv at /home/troels/.julia/packages/Flux/qXNjB/src/layers/conv.jl:55
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:323
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:411
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:362 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:773
jl_interpret_toplevel_thunk_callback at /buildworker/worker/package_linux64/build/src/interpreter.c:885
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x7f5b5c91a94f)
unknown function (ip: 0xffffffffffffffff)
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:894
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:764
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:793
eval at ./boot.jl:328
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/REPL/src/REPL.jl:85
run_backend at /home/troels/.julia/packages/Revise/agmgx/src/Revise.jl:949
#75 at ./task.jl:259
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1842
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2197
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1571 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:572
unknown function (ip: 0xffffffffffffffff)
Allocations: 40188553 (Pool: 40182088; Big: 6465); GC: 81
Segmentation fault (core dumped)

Depending on details in kernel size, the size of the input array, and whether the padding is symmetric or not, I get different error messages/crashes (double free or corruption (!prev), corrupted size vs. prev_size, malloc(): smallbin double linked list corrupted, malloc_consolidate(): invalid chunk size...). Typically, the convolution crashes when the input arrays become larger than some small size. So far, the issue only seems to affect Float32 inputs.

I'm on Julia 1.1.1, Flux 0.8.3, NNlib 0.6.0 (and Ubuntu 19.04).

Sleort avatar Jun 01 '19 13:06 Sleort

Okay, so I tried to see if I could fix this issue myself (as a fix would be very useful to my work). In short, it seems to be an indexing problem.

Looking through the code, I (initially) figured it should be sufficient to modify calc_padding_regions() (in NNlib.jl/src/impl/padding_edges.jl ). Namely, change

calc_lo_spill(O, S, P) = min(ceil(Int, P/S), O)
@inline function calc_hi_spill(O, S, Pl, Ph, K, D, I)
    wasted_Ph = (I + Pl + Ph - (K - 1)*D - 1)%S
    return min(ceil(Int, (Ph - wasted_Ph)/S), O)
end

to

calc_lo_spill(O, S, P) = max(min(ceil(Int, P/S), O), 0)
@inline function calc_hi_spill(O, S, Pl, Ph, K, D, I)
    wasted_Ph = (I + Pl + Ph - (K - 1)*D - 1)%S
    return max(min(ceil(Int, (Ph - wasted_Ph)/S), O), 0)
end

However, when I was going to test this, I suddenly realized that NNlib.jl now defaults to NNPACK (I'm using Linux), which means that negative padding now throws a NNPACKError(code 12, NNPACK STATUS INVALID INPUT PADDING) error. NNPACK is very much a black box to me, so... does anyone (@staticfloat , @avik-pal ?) have a suggestion for how to solve this (in a hopefully performant manner)?

(Yes, I know I can crop the input array "manually", but that seems like a much less elegant/performant solution than using negative paddings...)

Sleort avatar Jun 10 '19 09:06 Sleort

Currently there is no way to not use NNPACK. And NNPACK does not support negative padding so a check for this should be the ideal solution. But for the time being you could dev the package and change these lines to is_nnpack_available() = false.

avik-pal avatar Jun 10 '19 09:06 avik-pal

I will add this to my TODO: I have a cleanup planned to make it easier to integrate the blocked convolution code in https://github.com/FluxML/NNlib.jl/pull/97. It will be possible to disable NNPACK for negative-padded convolutions, as well as be able to disable NNPACK entirely, if you so choose.

@Sleort, in the meantime, you can, of course, just use conv_im2col instead of conv in your code. If you're using Flux, you can do this with the following override:

using Flux

Core.eval(Flux, quote
	function (c::Conv)(x::AbstractArray)
		σ, b = c.σ, reshape(c.bias, map(_->1, c.stride)..., :, 1)
		cdims = DenseConvDims(x, c.weight; stride=c.stride, padding=c.pad, dilation=c.dilation)

    	# Just change `conv` to `conv_im2col`
		σ.(conv_im2col(x, c.weight, cdims) .+ b)
	end
end)

staticfloat avatar Jun 10 '19 20:06 staticfloat

@avik-pal I see. Yes, I think there should at least be a check here. My test case sometimes run (for smaller system sizes, maybe conv_direct.jl is used?) and sometimes not, which is quite confusing when you don't know the details of the internals of the library.

@staticfloat Thanks! And thanks for the tip!

Sleort avatar Jun 11 '19 00:06 Sleort

Recently, we've made NNPACK not be the default, so would be good to revisit this as @staticfloat mentioned.

DhairyaLGandhi avatar Feb 24 '20 08:02 DhairyaLGandhi

My work so far is here: https://github.com/FluxML/NNlib.jl/pull/163 but it needs to be rebased.

staticfloat avatar Feb 25 '20 09:02 staticfloat