onnx-mlir Use OGE instead of OLT for ONNXReluOp

This patch uses greater than or equal to zero instead of less than zero in the lowering of ONNXReluOp. Mathematically, they are the same, but in practice it has caused valgrind unhappy when compling a model with -O3 (There is no issue with -O{0,1,2}). I have not yet known the root cause, but at lease this patch makes valgrind happy now.

Without this patch, when compiling inception-v1-12 using -O3 and running inference by using valgrind valgrind --leak-check=yes --leak-check=full --show-leak-kinds=all --track-origins=yes ./a.out, valgrind reported this error and terminated the inference:

valgrind: m_mallocfree.c:305 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 486241, hi = 3471210084346180918.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.


host stacktrace:
==7==    at 0x80004BF4C: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7==    by 0x80004C135: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7==    by 0x80004C275: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7==    by 0x8000578C5: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7==    by 0x800040507: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7==    by 0x80003F3D3: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7==    by 0x8000447F5: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7==    by 0x80003E701: ??? (in /usr/libexec/valgrind/memcheck-s390x-linux)
==7==    by 0x1009431B3B: ???

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 7)
==7==    at 0x50A1506: main_graph (in /model/CPU/inception-v1-12.so)
==7==    by 0x50B1075: run_main_graph (in /model/CPU/inception-v1-12.so)
==7==    by 0x10053CF: ModelLib::runMainGraph(OMTensorList*) (ModelLib.h:34)
==7==    by 0x10041BD: main (modelzoo.cpp:253)
client stack range: [0x1FFEFFE000 0x1FFF000FFF] client SP: 0x1FFF000498
valgrind stack range: [0x1008C9E000 0x1008D9DFFF] top usage: 13080 of 1048576

With this patch, valgrind is happy now:

==3658939== Memcheck, a memory error detector
==3658939== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3658939== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3658939== Command: ./a.out
==3658939==
==3658939== HEAP SUMMARY:
==3658939==     in use at exit: 0 bytes in 0 blocks
==3658939==   total heap usage: 176 allocs, 176 frees, 37,445,480 bytes allocated
==3658939==
==3658939== All heap blocks were freed -- no leaks are possible
==3658939==
==3658939== For lists of detected and suppressed errors, rerun with: -s
==3658939== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Signed-off-by: Tung D. Le [email protected]

Jul 15 '22 06:07 tungld

This patch uses greater than or equal to zero instead of less than zero in the lowering of ONNXReluOp. Mathematically, they are the same, but in practice it has caused valgrind unhappy when compling a model with -O3 (There is no issue with -O{0,1,2}). I have not yet known the root cause, but at lease this patch makes valgrind happy now.

Remember that in practice if you have a variable x that's declared unsigned, then x < 0 will never return true even if you do x = -5 (which will just give x a big positive number).

valgrind: m_mallocfree.c:305 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed. valgrind: Heap block lo/hi size mismatch: lo = 486241, hi = 3471210084346180918.

3471210084346180918 is 0x302C372E32353536, which doesn't look like a number but rather an ASCII string. In fact, it's 0,7.2556. So as suggested by valgrind, somewhere the code wrote 0,7.2556 past the end of the malloc-ed memory block and corrupted the malloc metadata of the next memory block.

Jul 15 '22 07:07 gongsu832

Remember that in practice if you have a variable x that's declared unsigned, then x < 0 will never return true even if you do x = -5 (which will just give x a big positive number).

Yes, it's true. In this model, data are floats, so I guess this was not the case.

Another thing is if I use ULT (unordered less than) instead of OLT(ordered less than), it works also. @gongsu832 do you think it is something related to NaN values, or something else?

Jul 15 '22 08:07 tungld

@tungld It's difficult to guess what causes a memory overrun. Valgrind should have caught the bad write and put out some diag info. If you just run valgrind ./a.out, what output do you see? Also, how did you produce the a.out? If you share the driver code (or build instructions), I can try to reproduce the problem.

Jul 15 '22 10:07 gongsu832

@gongsu832 thanks!

This is the driver code, say, c-driver.cpp:

#include "OnnxMlirRuntime.h"
#include <iostream>

extern "C" OMTensorList *run_main_graph(OMTensorList *);
extern "C" OMTensor *omTensorCreateWithOwnership(void *, int64_t *, int64_t, OM_DATA_TYPE, int64_t);
extern "C" OMTensorList *omTensorListCreateWithOwnership(OMTensor **, int64_t, int64_t);
extern "C" void omTensorListDestroy(OMTensorList *list);

int main(int argc, char **argv) {
  // Init inputs.
  int64_t shape[4] = {1, 3, 224, 224};
  int64_t rank = 4;
  int64_t size = 1;
  for (int64_t i = 0; i < rank; ++i)
    size *= shape[i];
  float *data = (float *)malloc(size * sizeof(float));
  for (int64_t i = 0; i < size; ++i)
    data[i] = 1.0;
  OMTensor *tensor =
      omTensorCreateWithOwnership(data, shape, rank, ONNX_TYPE_FLOAT, true);

  int64_t inputNum = 1;
  OMTensor **inputTensors = (OMTensor **)malloc(inputNum * sizeof(OMTensor *));
  inputTensors[0] = tensor;
  OMTensorList *tensorListIn =
      omTensorListCreateWithOwnership(inputTensors, inputNum, true);

  // Call the compiled onnx model function.
  std::cout << "Start computing " << std::endl;
  OMTensorList *tensorListOut = nullptr;
  tensorListOut = run_main_graph(tensorListIn);
  std::cout << "Finish computing " << std::endl;

  // Cleanup.
  omTensorListDestroy(tensorListOut);
  omTensorListDestroy(tensorListIn);
  return 0;
}

This is a script, run.sh, to build the model and run inference with valgrind:

#!/bin/sh

MODEL=inception-v1-12

rm ${MODEL}.so a.out

echo "Compile the model ${MODEL} ..."
/home/tungld/dl/onnx-mlir/build/Debug/bin/onnx-mlir -O3 --mcpu=z14 ${MODEL}.onnx

echo "Compile the driver ..."
g++ c-driver.cpp ${MODEL}.so -I/home/tungld/dl/onnx-mlir/include -lcruntime -L/home/tungld/dl/onnx-mlir/build/Debug/lib

echo "run inference ..."
/home/tungld/dl/valgrind/bin/valgrind --leak-check=yes --leak-check=full --show-leak-kinds=all --track-origins=yes ./a.out

You download the inception model from here: https://github.com/onnx/models/blob/main/vision/classification/inception_and_googlenet/inception_v1/model/inception-v1-12.onnx.

Then, run run.sh. Please make sure to use valgrind version 3.19.0.

Jul 15 '22 11:07 tungld

@tungld I was able to reproduce the problem and here is what I found out.

valgrind actually reports a lot of invalid writes like the following:

==16186== Invalid write of size 4
==16186==    at 0x486006A: main_graph (in /workdir/inception/inception-v1-12.so)
==16186==    by 0x4870045: run_main_graph (in /workdir/inception/inception-v1-12.so)
==16186==    by 0x1093AB: main (inception.cpp:38)
==16186==  Address 0x7ba0370 is 64 bytes inside an unallocated block of size 7,344 in arena "client"
==16186==
==16186== Invalid write of size 4
==16186==    at 0x4860076: main_graph (in /workdir/inception/inception-v1-12.so)
==16186==    by 0x4870045: run_main_graph (in /workdir/inception/inception-v1-12.so)
==16186==    by 0x1093AB: main (inception.cpp:38)
==16186==  Address 0x7ba0400 is 208 bytes inside an unallocated block of size 7,344 in arena "client"
==16186==
==16186== Invalid write of size 4
==16186==    at 0x486007C: main_graph (in /workdir/inception/inception-v1-12.so)
==16186==    by 0x4870045: run_main_graph (in /workdir/inception/inception-v1-12.so)
==16186==    by 0x1093AB: main (inception.cpp:38)
==16186==  Address 0x7ba0490 is 352 bytes inside an unallocated block of size 7,344 in arena "client"

These bad writes are from the vstef (Vector Store Element) instructions in the generated .so:

   15040:       e7 1e d0 00 20 05       vlrepf  %v1,0(%r14,%r13)
   15046:       e7 14 d0 00 10 03       vlef    %v1,0(%r4,%r13),1
   1504c:       e7 15 d0 00 20 03       vlef    %v1,0(%r5,%r13),2
   15052:       e7 13 d0 00 30 03       vlef    %v1,0(%r3,%r13),3
   15058:       e7 11 00 10 20 ef       vfmaxsb %v1,%v1,%v0,1
   1505e:       e3 2c b9 00 00 36       pfd     2,2304(%r12,%r11)
   15064:       e7 1e b0 00 00 0b       vstef   %v1,0(%r14,%r11),0
   1506a:       e7 14 b0 00 10 0b       vstef   %v1,0(%r4,%r11),1    <=== bad write at 0x486006A
   15070:       e3 2c ba 20 00 36       pfd     2,2592(%r12,%r11)
   15076:       e7 15 b0 00 20 0b       vstef   %v1,0(%r5,%r11),2    <=== bad write at 0x4860076
   1507c:       e7 13 b0 00 30 0b       vstef   %v1,0(%r3,%r11),3    <=== bad write at 0x486007C
   15082:       e7 1e d0 04 20 05       vlrepf  %v1,4(%r14,%r13)
   15088:       e7 14 d0 04 10 03       vlef    %v1,4(%r4,%r13),1
   1508e:       b9 04 00 95             lgr     %r9,%r5
   15092:       a5 9b 00 10             oill    %r9,16
   15096:       e7 15 d0 04 20 03       vlef    %v1,4(%r5,%r13),2
   1509c:       41 20 20 04             la      %r2,4(%r2)
   150a0:       41 10 10 01             la      %r1,1(%r1)
   150a4:       e7 2e d0 08 20 05       vlrepf  %v2,8(%r14,%r13)

After applying your PR, the code around the same offset becomes:

   1503e:       ed 14 12 98 00 24       lde     %f1,664(%r4,%r1)
   15044:       b3 02 00 21             ltebr   %f2,%f1
   15048:       a7 54 00 79             jnhe    1513a <main_graph+0x13aaa>
   1504c:       70 14 22 98             ste     %f1,664(%r4,%r2)
   15050:       ed 14 12 9c 00 24       lde     %f1,668(%r4,%r1)
   15056:       b3 02 00 21             ltebr   %f2,%f1
   1505a:       a7 54 00 7a             jnhe    1514e <main_graph+0x13abe>
   1505e:       70 14 22 9c             ste     %f1,668(%r4,%r2)
   15062:       ed 14 12 a0 00 24       lde     %f1,672(%r4,%r1)
   15068:       b3 02 00 21             ltebr   %f2,%f1
   1506c:       a7 54 00 7b             jnhe    15162 <main_graph+0x13ad2>
   15070:       70 14 22 a0             ste     %f1,672(%r4,%r2)
   15074:       ed 14 12 a4 00 24       lde     %f1,676(%r4,%r1)
   1507a:       b3 02 00 21             ltebr   %f2,%f1
   1507e:       a7 a4 ff 86             jhe     14f8a <main_graph+0x138fa>
   15082:       a7 f4 00 7a             j       15176 <main_graph+0x13ae6>
   15086:       28 10                   ldr     %f1,%f0
   15088:       70 14 22 74             ste     %f1,628(%r4,%r2)
   1508c:       ed 14 12 78 00 24       lde     %f1,632(%r4,%r1)
   15092:       b3 02 00 21             ltebr   %f2,%f1
   15096:       a7 a4 ff 93             jhe     14fbc <main_graph+0x1392c>
   1509a:       28 10                   ldr     %f1,%f0
   1509c:       70 14 22 78             ste     %f1,632(%r4,%r2)
   150a0:       ed 14 12 7c 00 24       lde     %f1,636(%r4,%r1)
   150a6:       b3 02 00 21             ltebr   %f2,%f1

So basically, your PR worked simply because vector instructions are no longer being generated. And I verified that with --mcpu=z13 there is no error reported by valgrind. So it seems that vector code generation has some problem. It sounds like certain size/shape calculations are wrong and elements are being written to bad memory locations.

Jul 16 '22 07:07 gongsu832

@gongsu832 thank you for the investigation! It looks reasonable that its related to vector instructions.

In onnx-mlir, so far only Gemm and Matmul are explicitly using vector if -O3 is enabled. This model has only one Gemm near the end and this error happened even when I removed Gemm. So it potentially was caused by some optimization at the lower level, e.g. LLVM. This would take me a long time to investigate it further, e.g. finding a simple case that can reproduce the issue. Perhaps, I would like to merge this first if there is no objection.

Jul 19 '22 09:07 tungld

@tungld Do you know why your PR prevented vector instructions to be generated? I have no objection to merge the PR as long as we remember this is just a workaround not a real fix.

Jul 19 '22 10:07 gongsu832

Sounds like finding a small example is key to potentially shipping this issue to the Z/LLVM folks. If isolated, would be interesting to see if its also a problem on P (maybe for the LLVM team to find out).

Jul 19 '22 17:07 AlexandreEichenberger

Do you know why your PR prevented vector instructions to be generated?

@gongsu832 I don't know. I removed operations in the model one by one until a Relu that caused the issue, then tried to edit Relu. Should be some optimization at a very low level of code.

Jul 20 '22 02:07 tungld

Maybe you could try tuning without the mlir bundling of memory allocs, in case that is triggering the issue.

Jul 20 '22 15:07 AlexandreEichenberger

So no idea of why this help, but it just works, correct?

Sep 06 '22 14:09 AlexandreEichenberger

So no idea of why this help, but it just works, correct?

Yes, we haven't found the cause.

Sep 07 '22 05:09 tungld

Jenkins Linux amd64 Build #7557 [push] Use OGE instead of OLT f... started at 21:53

Sep 14 '22 02:09 jenkins-droid

Jenkins Linux s390x Build #7573 [push] Use OGE instead of OLT f... started at 22:53

Sep 14 '22 02:09 jenkins-droid

Jenkins Linux ppc64le Build #6622 [push] Use OGE instead of OLT f... started at 22:55

Sep 14 '22 02:09 jenkins-droid

Jenkins Linux amd64 Build #7557 [push] Use OGE instead of OLT f... passed after 1 hr 7 min

Sep 14 '22 04:09 jenkins-droid

Jenkins Linux s390x Build #7573 [push] Use OGE instead of OLT f... passed after 1 hr 21 min

Sep 14 '22 04:09 jenkins-droid

Jenkins Linux ppc64le Build #6622 [push] Use OGE instead of OLT f... passed after 1 hr 37 min

Sep 14 '22 04:09 jenkins-droid

onnx-mlir onnx-mlir copied to clipboard

Use OGE instead of OLT for ONNXReluOp

onnx-mlir
onnx-mlir copied to clipboard