onnx-mlir icon indicating copy to clipboard operation
onnx-mlir copied to clipboard

Valgrind fails on googlenet, inception, shufflenet, and squeezenet

Open weonyuan opened this issue 2 years ago • 8 comments

C++ client, CPU model, VALGRIND 3.21.0 (latest) all versions of googlenet, inception, shufflenet, and squeezenet. Example from squeezenet1.0-12

valgrind --leak-check=yes --leak-check=full --show-leak-kinds=all --track-origins=yes /code/client/bin/modelzoo --iterations 1 --validate --msg-level INFO --file /model/squeezenet1.0-12.tests --lib /model/squeezenet1.0-12.so --fc-parms 0.01,0.0,1,10 --data-set-indices 0

  ==90== Memcheck, a memory error detector
  ==90== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
  ==90== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
  ==90== Command: /code/client/bin/modelzoo --iterations 1 --validate --msg-level INFO --file /model/squeezenet1.0-12.tests --lib /model/squeezenet1.0-12.so --fc-parms 0.01,0.0,1,10 --data-set-indices 0
  ==90== 
  Iteration 0 dataset 0: Running
  ==90== Invalid read of size 16
  ==90==    at 0x509596C: main_graph (in /model/squeezenet1.0-12.so)
  ==90==    by 0x5095945: main_graph (in /model/squeezenet1.0-12.so)
  ==90==  Address 0x5d52c40 is 0 bytes after a block of size 193,616 alloc'd
  ==90==    at 0x48382F0: malloc (vg_replace_malloc.c:431)
  ==90==    by 0x5093CEF: main_graph (in /model/squeezenet1.0-12.so)
  ==90== 
  ==90== Invalid read of size 16
  ==90==    at 0x5095978: main_graph (in /model/squeezenet1.0-12.so)
  ==90==    by 0x5095945: main_graph (in /model/squeezenet1.0-12.so)
  ==90==  Address 0x5d52c60 is 32 bytes before a block of size 32 in arena "client"
  ==90== 
  ==90== Invalid read of size 16
  ==90==    at 0x509597E: main_graph (in /model/squeezenet1.0-12.so)
  ==90==    by 0x5095945: main_graph (in /model/squeezenet1.0-12.so)
  ==90==  Address 0x5d52c50 is 16 bytes after a block of size 193,616 alloc'd
  ==90==    at 0x48382F0: malloc (vg_replace_malloc.c:431)
  ==90==    by 0x5093CEF: main_graph (in /model/squeezenet1.0-12.so)
  ==90== 
  ==90== Invalid read of size 16
  ==90==    at 0x50996A0: main_graph (in /model/squeezenet1.0-12.so)
  ==90==    by 0x5099679: main_graph (in /model/squeezenet1.0-12.so)
  ==90==  Address 0x64879d0 is 0 bytes after a block of size 193,616 alloc'd
  ==90==    at 0x48382F0: malloc (vg_replace_malloc.c:431)
  ==90==    by 0x5097A23: main_graph (in /model/squeezenet1.0-12.so)
  ==90== 
  ==90== Invalid read of size 16
  ==90==    at 0x50996AC: main_graph (in /model/squeezenet1.0-12.so)
  ==90==    by 0x5099679: main_graph (in /model/squeezenet1.0-12.so)
  ==90==  Address 0x64879f0 is 32 bytes before an unallocated block of size 902,608 in arena "client"
...
  ==90== 
  ==90== HEAP SUMMARY:
  ==90==     in use at exit: 0 bytes in 0 blocks
  ==90==   total heap usage: 23,411 allocs, 23,411 frees, 41,541,222 bytes allocated
  ==90== 
  ==90== All heap blocks were freed -- no leaks are possible
  ==90== 
  ==90== For lists of detected and suppressed errors, rerun with: -s
  ==90== ERROR SUMMARY: 17 errors from 17 contexts (suppressed: 0 from 0)

weonyuan avatar Aug 03 '23 07:08 weonyuan

On which machine, under which options?

AlexandreEichenberger avatar Aug 03 '23 14:08 AlexandreEichenberger

The models were compiled on s390x for CPU with options --EmitLib --O3 --onnx-op-stats=TXT --mtriple=s390x-ibm-loz --mcpu=z14

cjvolzka avatar Aug 07 '23 16:08 cjvolzka

Did this start happening after certain onnx-mlir commit?

gongsu832 avatar Aug 09 '23 01:08 gongsu832

AFAIK, this valgrind result was captured with the onnx-mlir commit SHA https://github.com/onnx/onnx-mlir/commit/e7dcf975f030183084a3771e6626ec19aaab7987

weonyuan avatar Aug 09 '23 03:08 weonyuan

@gongsu832 We don't frequently run the valgrind tests because it takes an exceptional amount of time. Last time we ran it was probably near the 0.4.0 release so we can't easily limit the commit range beyond that.

cjvolzka avatar Aug 10 '23 02:08 cjvolzka

Hi @gongsu832, just want to follow up. Has there been any updates on this?

weonyuan avatar Aug 15 '23 18:08 weonyuan

No. I haven't had a chance. I will try to look at it in the next a couple of days.

gongsu832 avatar Aug 16 '23 01:08 gongsu832

I narrowed down the commit that starts the issue to https://github.com/onnx/onnx-mlir/commit/8e20096e0ffdbeb44adf5b9ea61c7a34e1842eaa. The commit right before that doesn't have valgrind issues with squeezenet1.0-12

cjvolzka avatar Aug 16 '23 15:08 cjvolzka