onnx-mlir
onnx-mlir copied to clipboard
Use OGE instead of OLT for ONNXReluOp
This patch uses greater than or equal to zero
instead of less than zero
in the lowering of ONNXReluOp. Mathematically, they are the same, but in practice it has caused valgrind unhappy when compling a model with -O3
(There is no issue with -O{0,1,2}). I have not yet known the root cause, but at lease this patch makes valgrind happy now.
Without this patch, when compiling inception-v1-12 using -O3
and running inference by using valgrind valgrind --leak-check=yes --leak-check=full --show-leak-kinds=all --track-origins=yes ./a.out
, valgrind reported this error and terminated the inference:
valgrind: m_mallocfree.c:305 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 486241, hi = 3471210084346180918.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata. If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away. Please try that before reporting this as a bug.
host stacktrace:
==7== at 0x80004BF4C: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7== by 0x80004C135: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7== by 0x80004C275: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7== by 0x8000578C5: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7== by 0x800040507: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7== by 0x80003F3D3: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7== by 0x8000447F5: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7== by 0x80003E701: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7== by 0x1009431B3B: ???
sched status:
running_tid=1
Thread 1: status = VgTs_Runnable (lwpid 7)
==7== at 0x50A1506: main_graph (in /model/CPU/inception-v1-12.so)
==7== by 0x50B1075: run_main_graph (in /model/CPU/inception-v1-12.so)
==7== by 0x10053CF: ModelLib::runMainGraph(OMTensorList*) (ModelLib.h:34)
==7== by 0x10041BD: main (modelzoo.cpp:253)
client stack range: [0x1FFEFFE000 0x1FFF000FFF] client SP: 0x1FFF000498
valgrind stack range: [0x1008C9E000 0x1008D9DFFF] top usage: 13080 of 1048576
With this patch, valgrind is happy now:
==3658939== Memcheck, a memory error detector
==3658939== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3658939== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3658939== Command: ./a.out
==3658939==
==3658939== HEAP SUMMARY:
==3658939== in use at exit: 0 bytes in 0 blocks
==3658939== total heap usage: 176 allocs, 176 frees, 37,445,480 bytes allocated
==3658939==
==3658939== All heap blocks were freed -- no leaks are possible
==3658939==
==3658939== For lists of detected and suppressed errors, rerun with: -s
==3658939== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Signed-off-by: Tung D. Le [email protected]
This patch uses
greater than or equal to zero
instead ofless than zero
in the lowering of ONNXReluOp. Mathematically, they are the same, but in practice it has caused valgrind unhappy when compling a model with-O3
(There is no issue with -O{0,1,2}). I have not yet known the root cause, but at lease this patch makes valgrind happy now.
Remember that in practice if you have a variable x
that's declared unsigned, then x < 0
will never return true even if you do x = -5
(which will just give x
a big positive number).
valgrind: m_mallocfree.c:305 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed. valgrind: Heap block lo/hi size mismatch: lo = 486241, hi = 3471210084346180918.
3471210084346180918
is 0x302C372E32353536
, which doesn't look like a number but rather an ASCII string. In fact, it's 0,7.2556
. So as suggested by valgrind, somewhere the code wrote 0,7.2556
past the end of the malloc-ed memory block and corrupted the malloc metadata of the next memory block.
Remember that in practice if you have a variable x that's declared unsigned, then x < 0 will never return true even if you do x = -5 (which will just give x a big positive number).
Yes, it's true. In this model, data are floats, so I guess this was not the case.
Another thing is if I use ULT (unordered less than) instead of OLT(ordered less than), it works also. @gongsu832 do you think it is something related to NaN values, or something else?
@tungld It's difficult to guess what causes a memory overrun. Valgrind should have caught the bad write and put out some diag info. If you just run valgrind ./a.out
, what output do you see? Also, how did you produce the a.out
? If you share the driver code (or build instructions), I can try to reproduce the problem.
@gongsu832 thanks!
This is the driver code, say, c-driver.cpp
:
#include "OnnxMlirRuntime.h"
#include <iostream>
extern "C" OMTensorList *run_main_graph(OMTensorList *);
extern "C" OMTensor *omTensorCreateWithOwnership(void *, int64_t *, int64_t, OM_DATA_TYPE, int64_t);
extern "C" OMTensorList *omTensorListCreateWithOwnership(OMTensor **, int64_t, int64_t);
extern "C" void omTensorListDestroy(OMTensorList *list);
int main(int argc, char **argv) {
// Init inputs.
int64_t shape[4] = {1, 3, 224, 224};
int64_t rank = 4;
int64_t size = 1;
for (int64_t i = 0; i < rank; ++i)
size *= shape[i];
float *data = (float *)malloc(size * sizeof(float));
for (int64_t i = 0; i < size; ++i)
data[i] = 1.0;
OMTensor *tensor =
omTensorCreateWithOwnership(data, shape, rank, ONNX_TYPE_FLOAT, true);
int64_t inputNum = 1;
OMTensor **inputTensors = (OMTensor **)malloc(inputNum * sizeof(OMTensor *));
inputTensors[0] = tensor;
OMTensorList *tensorListIn =
omTensorListCreateWithOwnership(inputTensors, inputNum, true);
// Call the compiled onnx model function.
std::cout << "Start computing " << std::endl;
OMTensorList *tensorListOut = nullptr;
tensorListOut = run_main_graph(tensorListIn);
std::cout << "Finish computing " << std::endl;
// Cleanup.
omTensorListDestroy(tensorListOut);
omTensorListDestroy(tensorListIn);
return 0;
}
This is a script, run.sh
, to build the model and run inference with valgrind:
#!/bin/sh
MODEL=inception-v1-12
rm ${MODEL}.so a.out
echo "Compile the model ${MODEL} ..."
/home/tungld/dl/onnx-mlir/build/Debug/bin/onnx-mlir -O3 --mcpu=z14 ${MODEL}.onnx
echo "Compile the driver ..."
g++ c-driver.cpp ${MODEL}.so -I/home/tungld/dl/onnx-mlir/include -lcruntime -L/home/tungld/dl/onnx-mlir/build/Debug/lib
echo "run inference ..."
/home/tungld/dl/valgrind/bin/valgrind --leak-check=yes --leak-check=full --show-leak-kinds=all --track-origins=yes ./a.out
You download the inception model from here: https://github.com/onnx/models/blob/main/vision/classification/inception_and_googlenet/inception_v1/model/inception-v1-12.onnx.
Then, run run.sh
. Please make sure to use valgrind version 3.19.0.
@tungld I was able to reproduce the problem and here is what I found out.
valgrind actually reports a lot of invalid writes like the following:
==16186== Invalid write of size 4
==16186== at 0x486006A: main_graph (in /workdir/inception/inception-v1-12.so)
==16186== by 0x4870045: run_main_graph (in /workdir/inception/inception-v1-12.so)
==16186== by 0x1093AB: main (inception.cpp:38)
==16186== Address 0x7ba0370 is 64 bytes inside an unallocated block of size 7,344 in arena "client"
==16186==
==16186== Invalid write of size 4
==16186== at 0x4860076: main_graph (in /workdir/inception/inception-v1-12.so)
==16186== by 0x4870045: run_main_graph (in /workdir/inception/inception-v1-12.so)
==16186== by 0x1093AB: main (inception.cpp:38)
==16186== Address 0x7ba0400 is 208 bytes inside an unallocated block of size 7,344 in arena "client"
==16186==
==16186== Invalid write of size 4
==16186== at 0x486007C: main_graph (in /workdir/inception/inception-v1-12.so)
==16186== by 0x4870045: run_main_graph (in /workdir/inception/inception-v1-12.so)
==16186== by 0x1093AB: main (inception.cpp:38)
==16186== Address 0x7ba0490 is 352 bytes inside an unallocated block of size 7,344 in arena "client"
These bad writes are from the vstef (Vector Store Element) instructions in the generated .so:
15040: e7 1e d0 00 20 05 vlrepf %v1,0(%r14,%r13)
15046: e7 14 d0 00 10 03 vlef %v1,0(%r4,%r13),1
1504c: e7 15 d0 00 20 03 vlef %v1,0(%r5,%r13),2
15052: e7 13 d0 00 30 03 vlef %v1,0(%r3,%r13),3
15058: e7 11 00 10 20 ef vfmaxsb %v1,%v1,%v0,1
1505e: e3 2c b9 00 00 36 pfd 2,2304(%r12,%r11)
15064: e7 1e b0 00 00 0b vstef %v1,0(%r14,%r11),0
1506a: e7 14 b0 00 10 0b vstef %v1,0(%r4,%r11),1 <=== bad write at 0x486006A
15070: e3 2c ba 20 00 36 pfd 2,2592(%r12,%r11)
15076: e7 15 b0 00 20 0b vstef %v1,0(%r5,%r11),2 <=== bad write at 0x4860076
1507c: e7 13 b0 00 30 0b vstef %v1,0(%r3,%r11),3 <=== bad write at 0x486007C
15082: e7 1e d0 04 20 05 vlrepf %v1,4(%r14,%r13)
15088: e7 14 d0 04 10 03 vlef %v1,4(%r4,%r13),1
1508e: b9 04 00 95 lgr %r9,%r5
15092: a5 9b 00 10 oill %r9,16
15096: e7 15 d0 04 20 03 vlef %v1,4(%r5,%r13),2
1509c: 41 20 20 04 la %r2,4(%r2)
150a0: 41 10 10 01 la %r1,1(%r1)
150a4: e7 2e d0 08 20 05 vlrepf %v2,8(%r14,%r13)
After applying your PR, the code around the same offset becomes:
1503e: ed 14 12 98 00 24 lde %f1,664(%r4,%r1)
15044: b3 02 00 21 ltebr %f2,%f1
15048: a7 54 00 79 jnhe 1513a <main_graph+0x13aaa>
1504c: 70 14 22 98 ste %f1,664(%r4,%r2)
15050: ed 14 12 9c 00 24 lde %f1,668(%r4,%r1)
15056: b3 02 00 21 ltebr %f2,%f1
1505a: a7 54 00 7a jnhe 1514e <main_graph+0x13abe>
1505e: 70 14 22 9c ste %f1,668(%r4,%r2)
15062: ed 14 12 a0 00 24 lde %f1,672(%r4,%r1)
15068: b3 02 00 21 ltebr %f2,%f1
1506c: a7 54 00 7b jnhe 15162 <main_graph+0x13ad2>
15070: 70 14 22 a0 ste %f1,672(%r4,%r2)
15074: ed 14 12 a4 00 24 lde %f1,676(%r4,%r1)
1507a: b3 02 00 21 ltebr %f2,%f1
1507e: a7 a4 ff 86 jhe 14f8a <main_graph+0x138fa>
15082: a7 f4 00 7a j 15176 <main_graph+0x13ae6>
15086: 28 10 ldr %f1,%f0
15088: 70 14 22 74 ste %f1,628(%r4,%r2)
1508c: ed 14 12 78 00 24 lde %f1,632(%r4,%r1)
15092: b3 02 00 21 ltebr %f2,%f1
15096: a7 a4 ff 93 jhe 14fbc <main_graph+0x1392c>
1509a: 28 10 ldr %f1,%f0
1509c: 70 14 22 78 ste %f1,632(%r4,%r2)
150a0: ed 14 12 7c 00 24 lde %f1,636(%r4,%r1)
150a6: b3 02 00 21 ltebr %f2,%f1
So basically, your PR worked simply because vector instructions are no longer being generated. And I verified that with --mcpu=z13
there is no error reported by valgrind. So it seems that vector code generation has some problem. It sounds like certain size/shape calculations are wrong and elements are being written to bad memory locations.
@gongsu832 thank you for the investigation! It looks reasonable that its related to vector instructions.
In onnx-mlir, so far only Gemm and Matmul are explicitly using vector if -O3
is enabled. This model has only one Gemm near the end and this error happened even when I removed Gemm. So it potentially was caused by some optimization at the lower level, e.g. LLVM. This would take me a long time to investigate it further, e.g. finding a simple case that can reproduce the issue. Perhaps, I would like to merge this first if there is no objection.
@tungld Do you know why your PR prevented vector instructions to be generated? I have no objection to merge the PR as long as we remember this is just a workaround not a real fix.
Sounds like finding a small example is key to potentially shipping this issue to the Z/LLVM folks. If isolated, would be interesting to see if its also a problem on P (maybe for the LLVM team to find out).
Do you know why your PR prevented vector instructions to be generated?
@gongsu832 I don't know. I removed operations in the model one by one until a Relu that caused the issue, then tried to edit Relu. Should be some optimization at a very low level of code.
Maybe you could try tuning without the mlir bundling of memory allocs, in case that is triggering the issue.
So no idea of why this help, but it just works, correct?
So no idea of why this help, but it just works, correct?
Yes, we haven't found the cause.
Jenkins Linux amd64 Build #7557 [push] Use OGE instead of OLT f... started at 21:53
Jenkins Linux s390x Build #7573 [push] Use OGE instead of OLT f... started at 22:53
Jenkins Linux ppc64le Build #6622 [push] Use OGE instead of OLT f... started at 22:55
Jenkins Linux amd64 Build #7557 [push] Use OGE instead of OLT f... passed after 1 hr 7 min
Jenkins Linux s390x Build #7573 [push] Use OGE instead of OLT f... passed after 1 hr 21 min
Jenkins Linux ppc64le Build #6622 [push] Use OGE instead of OLT f... passed after 1 hr 37 min