ipex-llm
ipex-llm copied to clipboard
MTL GPU driver not shown and GPU demo crashed on Linux
HW: MTL with ARC iGPU
OS: Ubuntu 22.04
Kernel: 6.5.0-41-generic
Ref: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md
Problem1: cannot find GPU driver by sycl-ls
.
Problem2: demo.py crashed. Log is attached.
`
intel-fw-gpu is already the newest version (2024.17.5-329~22.04).
intel-i915-dkms is already the newest version (1.24.2.17.240301.20+i29-1).
(llm) sdp@9049fa09fdbc:~$ source /opt/intel/oneapi/setvars.sh --force
:: initializing oneAPI environment ... -bash: BASH_VERSION = 5.1.16(1)-release args: Using "$@" for setvars.sh arguments: --force :: advisor -- latest :: ccl -- latest :: compiler -- latest :: dal -- latest :: debugger -- latest :: dev-utilities -- latest :: dnnl -- latest :: dpcpp-ct -- latest :: dpl -- latest :: ipp -- latest :: ippcp -- latest :: mkl -- latest :: mpi -- latest :: tbb -- latest :: vtune -- latest :: oneAPI environment initialized ::
(llm) sdp@9049fa09fdbc:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 1003H OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000] `
Demo crash: `[2:04 PM] Shi, Lei A (llm) sdp@9049fa09fdbc:~$ python demo.py
/home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg
or libpng
installed before building torchvision
from source?
warn(
2024-06-27 22:51:53,784 - INFO - intel_extension_for_pytorch auto imported
/home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
2024-06-27 22:51:54,304 - WARNING -
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.95s/it]
2024-06-27 22:52:04,476 - INFO - Converting the current model to sym_int4 format......
LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 1003H]
Registry and code: 13 MB
Command: python demo.py
Uptime: 17.979020 s
Segmentation fault (core dumped)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.25s/it]
2024-06-27 22:54:34,472 - INFO - Converting the current model to sym_int4 format......
[Detaching after vfork from child process 23148]
[New Thread 0x7fffb6fee640 (LWP 23152)]
[New Thread 0x7fffd57fb640 (LWP 23153)]
[New Thread 0x7fffd2ffa640 (LWP 23154)]
[New Thread 0x7fffd07f9640 (LWP 23155)]
[New Thread 0x7fffcdff8640 (LWP 23156)]
[New Thread 0x7fffcb7f7640 (LWP 23157)]
[New Thread 0x7fffc8ff6640 (LWP 23158)]
[New Thread 0x7fffc67f5640 (LWP 23159)]
[New Thread 0x7fffc3ff4640 (LWP 23160)]
[New Thread 0x7fffc17f3640 (LWP 23161)]
[New Thread 0x7fffbeff2640 (LWP 23162)]
[New Thread 0x7fffbe7f1640 (LWP 23163)]
[New Thread 0x7fffb9ff0640 (LWP 23164)]
[New Thread 0x7fffb77ef640 (LWP 23165)]
[New Thread 0x7ffecdf53640 (LWP 23166)]
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fff005f16ab in xpu::dpcpp::initGlobalDevicePoolState() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so
(gdb) bt
#0 0x00007fff005f16ab in xpu::dpcpp::initGlobalDevicePoolState() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so
#1 0x00007ffff7c99ee8 in __pthread_once_slow (once_control=0x7fff13cbddd8 xpu::dpcpp::init_device_flag, init_routine=0x7fffe0cdad50 <__once_proxy>) at ./nptl/pthread_once.c:116
#2 0x00007fff005ee491 in xpu::dpcpp::dpcppGetDeviceCount(int*) () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so
#3 0x00007fff005a8c52 in xpu::dpcpp::device_count()::{lambda()#1}::operator()() const ()
from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so
#4 0x00007fff005a8c18 in xpu::dpcpp::device_count() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so
#5 0x00007fffa23be0c8 in xpu::THPModule_initExtension(_object*, _object*) ()
from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-python.so
#6 0x000055555573950e in cfunction_vectorcall_NOARGS (func=0x7fffa2410c20, args=
at /usr/local/src/conda/python-3.11.9/Include/cpython/methodobject.h:52
#7 0x000055555574eeac in _PyObject_VectorcallTstate (kwnames=
tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92
#8 PyObject_Vectorcall (callable=0x7fffa2410c20, args=
#9 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=
#10 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7fb07d0, tstate=0x555555ad0998 <_PyRuntime+166328>)
at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#11 _PyEval_Vector (kwnames=
at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#12 _PyFunction_Vectorcall (func=
#13 0x0000555555730244 in _PyObject_VectorcallTstate (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffee5567380, args=
kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92
#14 0x00005555557fef1c in PyObject_CallMethod (obj=
#15 0x00007fffa23bb48d in xpu::lazy_init() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-python.so
#16 0x00007fff005a8d86 in xpu::dpcpp::current_device() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so
#17 0x00007fff005ad5b6 in xpu::dpcpp::impl::DPCPPGuardImpl::getDevice() const ()
from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so
#18 0x00007fffe29b274f in at::native::to(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional
#19 0x00007fffe37c3743 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional
from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007fffe3049eea in at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional
#21 0x00007fffef1dfa19 in torch::autograd::dispatch_to(at::Tensor const&, c10::Device, bool, bool, c10::optionalc10::MemoryFormat) ()
from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#22 0x00007fffef24a8ec in torch::autograd::THPVariable_to(_object*, _object*, _object*) () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#23 0x000055555575f1c8 in method_vectorcall_VARARGS_KEYWORDS (func=0x7ffff7104360, args=0x7ffff7fb07a8, nargsf=
at /usr/local/src/conda/python-3.11.9/Objects/descrobject.c:364
#24 0x000055555574eeac in _PyObject_VectorcallTstate (kwnames=
tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92
#25 PyObject_Vectorcall (callable=0x7ffff7104360, args=
#26 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=
#27 0x0000555555783fc2 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7fb0140, tstate=0x555555ad0998 <_PyRuntime+166328>)
at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#28 _PyEval_Vector (kwnames=
at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#29 _PyFunction_Vectorcall (kwnames=
#30 _PyObject_VectorcallTstate (kwnames=
at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92
#31 method_vectorcall (method=
--Type <RET> for more, q to quit, c to continue without paging--
`
After reinstall level-zero. Crash changed to "killed"