tvm
tvm copied to clipboard
[Bug] Segmentation fault (core dumped) in executing the inference
Actual behavior
Segmentation fault (core dumped)
Steps to reproduce
import tvm
from tvm import relax
import numpy as np
from tvm.script import ir as I
from tvm.script import tir as T
from tvm.script import relax as R
@I.ir_module
class Module:
@T.prim_func(private=True)
def layer_norm(A: T.Buffer((T.int64(1), T.int64(512), T.int64(64), T.int64(64)), "float32"), gamma: T.Buffer((T.int64(64), T.int64(64)), "float32"), beta: T.Buffer((T.int64(64), T.int64(64)), "float32"), T_layer_norm: T.Buffer((T.int64(1), T.int64(512), T.int64(64), T.int64(64)), "float32")):
T.func_attr({"op_pattern": 4})
# with T.block("root"):
rxplaceholder_red_temp_v0 = T.alloc_buffer((T.int64(64), T.int64(64)))
rxplaceholder_red_temp_v1 = T.alloc_buffer((T.int64(64), T.int64(64)))
for i0, i1, i2, i3 in T.grid(T.int64(1), T.int64(512), T.int64(64), T.int64(64)):
with T.block("rxplaceholder_red_temp"):
ax0, ax1, k2, k3 = T.axis.remap("SSRR", [i0, i1, i2, i3])
T.reads(A[ax0, ax1, k2, k3])
T.writes(rxplaceholder_red_temp_v0[ax0, ax1], rxplaceholder_red_temp_v1[ax0, ax1])
with T.init():
rxplaceholder_red_temp_v0[ax0, ax1] = T.float32(0)
rxplaceholder_red_temp_v1[ax0, ax1] = T.float32(0)
v_rxplaceholder_red_temp_v0: T.float32 = rxplaceholder_red_temp_v0[ax0, ax1] + A[ax0, ax1, k2, k3]
v_rxplaceholder_red_temp_v1: T.float32 = rxplaceholder_red_temp_v1[ax0, ax1] + A[ax0, ax1, k2, k3] * A[ax0, ax1, k2, k3]
rxplaceholder_red_temp_v0[ax0, ax1] = v_rxplaceholder_red_temp_v0
rxplaceholder_red_temp_v1[ax0, ax1] = v_rxplaceholder_red_temp_v1
for i0, i1, i2, i3 in T.grid(T.int64(1), T.int64(512), T.int64(64), T.int64(64)):
with T.block("T_layer_norm"):
ax0, ax1, ax2, ax3 = T.axis.remap("SSSS", [i0, i1, i2, i3])
T.reads(A[ax0, ax1, ax2, ax3], rxplaceholder_red_temp_v0[ax0, ax1], rxplaceholder_red_temp_v1[ax0, ax1], gamma[ax2, ax3], beta[ax2, ax3])
T.writes(T_layer_norm[ax0, ax1, ax2, ax3])
T_layer_norm[ax0, ax1, ax2, ax3] = (A[ax0, ax1, ax2, ax3] - rxplaceholder_red_temp_v0[ax0, ax1] * T.float32(0.050000000000000003)) * T.rsqrt(rxplaceholder_red_temp_v1[ax0, ax1] * T.float32(0.050000000000000003) - rxplaceholder_red_temp_v0[ax0, ax1] * T.float32(0.050000000000000003) * (rxplaceholder_red_temp_v0[ax0, ax1] * T.float32(0.050000000000000003)) + T.float32(1.0000000000000001e-05)) * gamma[ax2, ax3] + beta[ax2, ax3]
@T.prim_func(private=True)
def relu(A: T.Buffer((T.int64(1), T.int64(512), T.int64(64), T.int64(64)), "float32"), B: T.Buffer((T.int64(1), T.int64(512), T.int64(64), T.int64(64)), "float32")):
T.func_attr({"op_pattern": 0})
# with T.block("root"):
for i0, i1, i2, i3 in T.grid(T.int64(1), T.int64(512), T.int64(64), T.int64(64)):
with T.block("relu"):
v_i0, v_i1, v_i2, v_i3 = T.axis.remap("SSSS", [i0, i1, i2, i3])
T.reads(A[v_i0, v_i1, v_i2, v_i3])
T.writes(B[v_i0, v_i1, v_i2, v_i3])
B[v_i0, v_i1, v_i2, v_i3] = T.max(A[v_i0, v_i1, v_i2, v_i3], T.float32(0))
@R.function(private=True)
def fused_layer_norm_relu(x: R.Tensor((1, 512, 64, 64), dtype="float32"), mean: R.Tensor((64, 64), dtype="float32"), var: R.Tensor((64, 64), dtype="float32")) -> R.Tensor((1, 512, 64, 64), dtype="float32"):
R.func_attr({"Primitive": 1})
cls = Module
with R.dataflow():
gv0 = R.call_tir(cls.layer_norm, (x, mean, var), out_sinfo=R.Tensor((1, 512, 64, 64)))
gv = R.call_tir(cls.relu, (gv0,), out_sinfo=R.Tensor((1, 512, 64, 64), dtype="float32"))
R.output(gv)
return gv
@R.function
def main(x: R.Tensor((1, 512, 64, 64), dtype="float32"), mean: R.Tensor((64, 64), dtype="float32"), var: R.Tensor((64, 64), dtype="float32")) -> R.Tensor((1, 512, 64, 64), dtype="float32"):
cls = Module
with R.dataflow():
gv: R.Tensor((1, 512, 64, 64), dtype="float32") = cls.fused_layer_norm_relu(x, mean, var)
R.output(gv)
return gv
mod = Module
mod = relax.transform.FuseTIR()(mod)
def compile_mod(mod, func_name, target, *inputs):
ex = relax.build(mod, target='llvm')
vm = relax.VirtualMachine(ex, tvm.cpu())
mod_outputs = vm[f'{func_name}'](*inputs) #segfault
input_0 = tvm.nd.array(10 * np.random.random([1, 512, 64, 64]).astype('float32'))
input_1 = tvm.nd.array(10 * np.random.random([64, 64]).astype('float32'))
input_2 = tvm.nd.array(10 * np.random.random([64, 64]).astype('float32'))
compile_mod(mod, 'main', 'llvm', input_0,input_1,input_2)
CC @Lunderberg @vinx13
Looks like a bug in your layer_norm implementation. The rxplaceholder_red_temp_v0 and rxplaceholder_red_temp_v1 both have shape [64,64], but are being accessed at indices [0:1, 0:512]. Are these buffers intended to have shape [1,512]?
@Lunderberg Indeed, the indices are out of the boundary. However, Segmentation fault is a dangerous behavior and is often considered a vulnerability. Do we need an isolation mechanism to check the validity in the tir level?
True. There is a very old mechanism that uses the "tir.instrument_bound_checkers" config option to add buffer bounds, but the code path for it is only used for tir functions produced from TE schedules. The annotations it provides haven't been useful since #9727 a few years ago, as all BufferLoad and BufferStore instances have have sufficient information to do bounds-checking anyways. There's a newer tir::transform::OOBChecker() that does a better bounds checking, introduced in #12352, but it doesn't seem to be used anywhere.
I like the idea of having this check applied by default. Placing it either at the beginning or end of the TIR lowering pipeline would have caught this error during compilation.
@Lunderberg Thanks! The tvm.tir.analysis.OOBChecker() is very useful. I like it! It successfully identifies the OOB issue and avoids the segfault. But It was disabled by default. Why not enable it by default?
Overall, no significant reasons not to. There may be some initial failures in cases where the buffer size is unknown, and which use a shape of [1] as a fallback, but those probably should be fixed anyways.