pytorch_dlprim
pytorch_dlprim copied to clipboard
Crash when trying to use pytorch/glow, which was built on pytorch/opencl
Hi,
I am trying to use pytorch/glow with OPENCL backend enabled. I want to compare inference time on GPU for pytorch with glow enabled/disabled, thus I built pytorch with opencl as instructed in this repo.
Crash is not observed when model and data are not copied to GPU in infer_glow() via something.to('opencl:0') i.e. the below lines are commented
lowered_model = lowered_model.to(device=device)
inputs = inputs.to(device=device)
Could you please help me understand the issue, following is the gdb backtrace:
[New Thread 0x7fff74ffd700 (LWP 6990)]
**malloc_consolidate(): invalid chunk size**
Thread 1 "python3.7" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff7a227f1 in __GI_abort () at abort.c:79
#2 0x00007ffff7a6b837 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7b98a7b "%s\n") at ../sysdeps /posix/libc_fatal.c:181
#3 0x00007ffff7a728ba in malloc_printerr (str=str@entry=0x7ffff7b9a2d8 "malloc_consolidate(): invalid chunk size") at malloc.c:5342
#4 0x00007ffff7a72b5e in malloc_consolidate (av=av@entry=0x7ffff7dcdc40 <main_arena>) at malloc.c:4471
#5 0x00007ffff7a76848 in _int_malloc (av=av@entry=0x7ffff7dcdc40 <main_arena>, bytes=bytes@entry=4096) at malloc.c:3713
#6 0x00007ffff7a792ad in __GI___libc_malloc (bytes=4096) at malloc.c:3075
#7 0x00007fffe09fc150 in traced_realloc () from /usr/local/lib/python3.7/dist-packages/pandas/_libs/hashtable.cpython-37m-x86_64-linux-gnu.so
#8 0x00007fffe09fc44b in ?? () from /usr/local/lib/python3.7/dist-packages/pandas/_libs/hashtable.cpython-37m-x86_64-linux-gnu.so
#9 0x00007fffe09fefbc in ?? () from /usr/local/lib/python3.7/dist-packages/pandas/_libs/hashtable.cpython-37m-x86_64-linux-gnu.so
#10 0x0000000000588f15 in ?? ()
uname -a Linux ip-192-168-1-210 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Commit id's of the setup: dlprimitives: 6eb5794aec7b48fe2e2b8d1fa7b1eab712d72d87 pytorch-dlprim: 7ec2e47cd56fdad86e08d3aff65f7c35fc89b575 pytorch: eb74af18af6e90ae47f24997af8468bf7b9deb72 glow: cda5383b1609ebad1a3631ca77b41b8a863443d4 Built glow with few adaptations as above pytorch commit was bit older: git_diff.txt clinfo: clinfo.txt
Python Code: opencl_pytorch_glow.txt As I was not able to upload .py here, thus converted it to .txt
I also tried to use:
traced_m = torch.jit.trace(resnet.to('opencl:0'), (x.to('opencl:0')))
I am facing the below error:
torch.jit._trace.TracingCheckError: Tracing failed sanity checks!
encountered an exception while running the trace with test inputs.
Exception:
Unknown device for graph fuser
Please let me know, if you need any more information.
Few things:
- I have never used glow and don't really know how it and its role - so it is quite hard for me to understand the example.
- Can you create simplest example probably with simplest op (like 1-2 fully connected layers) so I can reproduce.
traced_m = torch.jit.trace(resnet.to('opencl:0'), (x.to('opencl:0')) ... Unknown device for graph fuser
Probably some other case that need to know device or something else. I must say opencl backend is really in early stages. So there many things that likely not going to work and need to be fixed.
Also I get different error when resnet is resnet50 or 18:
NotImplementedError: Could not run 'aten::isnan' with arguments from the 'PrivateUse1' backend
Which is not implemented yet.