intel-extension-for-pytorch
intel-extension-for-pytorch copied to clipboard
Arrays larger than 4 GB crashes
Describe the bug
Intel compute runtime doesn't allow allocating a buffer bigger than 4 GB.
https://github.com/intel/compute-runtime/issues/627
When you allocate an array in intel-extension-for-pytorch bigger than 4 GB in A770 16GB, it crashes.
x = torch.rand(46000, 46000, dtype=torch.float32, device='xpu')
Is it possible to allocate multiple buffers for an array instead of allocating one buffer for one array?
Versions
Collecting environment information...
PyTorch version: 1.13.0a0+gitb1dde16
PyTorch CXX11 ABI: Yes
IPEX version: 1.13.10+xpu
IPEX commit: 7d85b0e92
Build type: Release
OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: N/A
IGC version: N/A
CMake version: N/A
Libc version: glibc-2.35
Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-6.3.0-1-x86_64-with-glibc2.35
Is XPU available: True
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration:
[0] _DeviceProperties(name='Intel(R) Graphics [0x56a0]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512)
Intel OpenCL ICD version: 22.43.24595.35+i538~22.04
Level Zero version: 1.3.24595.35+i538~22.04
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: 13th Gen Intel(R) Core(TM) i5-13600K
BIOS Model name: 13th Gen Intel(R) Core(TM) i5-13600K
CPU family: 6
Model: 183
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 1
Stepping: 1
CPU max MHz: 5100.0000
CPU min MHz: 800.0000
BogoMIPS: 6991.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization: VT-x
L1d cache: 544 KiB (14 instances)
L1i cache: 704 KiB (14 instances)
L2 cache: 20 MiB (8 instances)
L3 cache: 24 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-19
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==1.13.10+xpu
[pip3] numpy==1.24.1
[pip3] torch==1.13.0a0+gitb1dde16
[pip3] torchvision==0.14.1a0+0504df5
[conda] N/A
@tye1
I did some further tests and it seems like allocating more than 4GB returns garbage or randomly crashes.
Example of allocating less than 4GB in A770 16GB. The mean is around 0.5 which is expected.
import torch
import torchvision.models as models
import numpy as np
import intel_extension_for_pytorch as ipex
torch.manual_seed(0)
x = torch.rand(30000, 30000, dtype=torch.float32, device='xpu')
print("Mean")
print(torch.mean(x).detach().cpu().numpy())
python3 ./test.py
Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
Mean
0.50001085
Example of allocating more than 4GB on CPU. The mean is around 0.5 which is expected.
import torch
import torchvision.models as models
import numpy as np
import intel_extension_for_pytorch as ipex
torch.manual_seed(0)
x = torch.rand(47000, 47000, dtype=torch.float32, device='cpu')
print("Mean")
print(torch.mean(x).detach().cpu().numpy())
python3 ./test.py
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
Mean
0.4999941
Example of allocating more than 4GB on A770 16GB. The mean is around 0.014 which is completely wrong.
import torch
import torchvision.models as models
import numpy as np
import intel_extension_for_pytorch as ipex
torch.manual_seed(0)
x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')
print("Mean")
print(torch.mean(x).detach().cpu().numpy())
python3 ./test.py
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
Mean
0.014004011
In conclusion, allocating more than 4GB crashes or returns complete garbage.
@jingxu10 Is memory allocation done by OpenCL, Level Zero, or OneDNN?
It should be allocated by Level-0. @gujinghui
@jingxu10
Will passing -ze-opt-greater-than-4GB-buffer-required into the build options fix it?
https://spec.oneapi.io/level-zero/latest/core/PROG.html#module-build-options
Hi, @BA8F0D39
What's the driver version? I cannot reproduce randomly crash with agama-ci-devel-602. From what I've tried, the max workable input shape of your ut is about 59500*59500, corresponds memory size of 13.2G. It is a reasonable result.
For accuracy issue, we will check it.
Hi @BA8F0D39
Thank you for using intel product and IPEX.
Now we can successfully create large memory(not larger than total physical memory size) and compute well.
Can you provide the driver version you are using by the below?
sudo dpkg -l | grep intel
And is it possible to add the following flags and attach the log here when you find the error?
export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1
Thank you.
@cchheennhhaaoo
On windows 11 WSL
ii intel-level-zero-gpu 1.3.24595.35+i538~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii intel-oneapi-runtime-ccl 2021.8.0-25371 amd64 Intel® oneAPI Collective Communications Library runtime
ii intel-oneapi-runtime-compilers 2023.0.0-25370 amd64 Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii intel-oneapi-runtime-compilers-common 2023.0.0-25370 all Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii intel-oneapi-runtime-dpcpp-cpp 2023.0.0-25370 amd64 Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii intel-oneapi-runtime-dpcpp-cpp-common 2023.0.0-25370 all Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii intel-oneapi-runtime-mkl 2023.0.0-25398 amd64 Intel® oneAPI Math Kernel Library runtime
ii intel-oneapi-runtime-mkl-common 2023.0.0-25398 all Intel® oneAPI Math Kernel Library runtime common
ii intel-oneapi-runtime-mpi 2021.8.0-25329 amd64 Intel® MPI Library runtime
ii intel-oneapi-runtime-opencl 2023.0.0-25370 amd64 Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii intel-oneapi-runtime-openmp 2023.0.0-25370 amd64 Intel® OpenMP* Runtime Library runtime
ii intel-oneapi-runtime-openmp-common 2023.0.0-25370 all l_openmp.runtime.description>
ii intel-oneapi-runtime-tbb 2021.8.0-25334 amd64 Intel® oneAPI Threading Building Blocks runtime
ii intel-oneapi-runtime-tbb-common 2021.8.0-25334 all Intel® oneAPI Threading Building Blocks runtime common
ii intel-opencl-icd 22.43.24595.35+i538~22.04 amd64 Intel graphics compute runtime for OpenCL
Code
import torch
import torchvision.models as models
import numpy as np
import intel_extension_for_pytorch as ipex
torch.manual_seed(0)
x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')
print("Mean")
print(torch.mean(x).detach().cpu().numpy())
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
zeContextCreate = 1 \---> zeContextDestroy = 1
zeCommandQueueCreate = 1 \---> zeCommandQueueDestroy = 1
zeModuleCreate = 1 \---> zeModuleDestroy = 1
zeKernelCreate = 1 \---> zeKernelDestroy = 1
zeEventPoolCreate = 1 \---> zeEventPoolDestroy = 1
zeCommandListCreateImmediate = 1 |
zeCommandListCreate = 2 \---> zeCommandListDestroy = 3
zeEventCreate = 8 \---> zeEventDestroy = 8
zeFenceCreate = 2 \---> zeFenceDestroy = 2
zeImageCreate = 0 \---> zeImageDestroy = 0
zeSamplerCreate = 0 \---> zeSamplerDestroy = 0
zeMemAllocDevice = 1 |
zeMemAllocHost = 0 |
zeMemAllocShared = 0 \---> zeMemFree = 0 ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
Aborted
@cchheennhhaaoo @zejun-chen
On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.
ii intel-level-zero-gpu 1.3.25593.18-601~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii intel-oneapi-runtime-ccl 2021.9.0-43543 amd64 Intel® oneAPI Collective Communications Library runtime
ii intel-oneapi-runtime-compilers 2023.1.0-46305 amd64 Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii intel-oneapi-runtime-compilers-common 2023.1.0-46305 all Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii intel-oneapi-runtime-dpcpp-cpp 2023.1.0-46305 amd64 Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii intel-oneapi-runtime-dpcpp-cpp-common 2023.1.0-46305 all Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii intel-oneapi-runtime-mkl 2023.1.0-46342 amd64 Intel® oneAPI Math Kernel Library runtime
ii intel-oneapi-runtime-mkl-common 2023.1.0-46342 all Intel® oneAPI Math Kernel Library runtime common
ii intel-oneapi-runtime-mpi 2021.9.0-43482 amd64 Intel® MPI Library runtime
ii intel-oneapi-runtime-opencl 2023.1.0-46305 amd64 Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii intel-oneapi-runtime-openmp 2023.1.0-46305 amd64 Intel® OpenMP* Runtime Library runtime
ii intel-oneapi-runtime-openmp-common 2023.1.0-46305 all l_openmp.runtime.description>
ii intel-oneapi-runtime-tbb 2021.9.0-43484 amd64 Intel® oneAPI Threading Building Blocks runtime
ii intel-oneapi-runtime-tbb-common 2021.9.0-43484 all Intel® oneAPI Threading Building Blocks runtime common
ii intel-opencl-icd 23.05.25593.18-601~22.04 amd64 Intel graphics compute runtime for OpenCL
ii libdrm-intel1:amd64 2.4.115+git2303241447.28d9a3c4~j~mesarc0 amd64 Userspace interface to intel-specific kernel DRM services -- runtime
Code
import torch
import torchvision.models as models
import numpy as np
import intel_extension_for_pytorch as ipex
torch.manual_seed(0)
x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')
print("Mean")
print(torch.mean(x).detach().cpu().numpy())
Crash
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
zeContextCreate = 1 \---> zeContextDestroy = 1
zeCommandQueueCreate = 2 \---> zeCommandQueueDestroy = 2
zeModuleCreate = 2 \---> zeModuleDestroy = 2
zeKernelCreate = 3 \---> zeKernelDestroy = 3
zeEventPoolCreate = 1 \---> zeEventPoolDestroy = 1
zeCommandListCreateImmediate = 1 |
zeCommandListCreate = 5 \---> zeCommandListDestroy = 6
zeEventCreate = 18 \---> zeEventDestroy = 18
zeFenceCreate = 5 \---> zeFenceDestroy = 5
zeImageCreate = 0 \---> zeImageDestroy = 0
zeSamplerCreate = 0 \---> zeSamplerDestroy = 0
zeMemAllocDevice = 2 |
zeMemAllocHost = 0 |
zeMemAllocShared = 0 \---> zeMemFree = 1 ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
Aborted (core dumped)
I believe this issue is caused by incorrect env setting. You can follow this blog to setup IPEX environment on WSL2 with docker: https://medium.com/intel-analytics-software/stable-diffusion-with-intel-arc-gpus-f2986bba8365
@cchheennhhaaoo @zejun-chen I have the same problem on Ubuntu Linux too (not using windows)
On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.
ii intel-level-zero-gpu 1.3.25593.18-601~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii intel-oneapi-runtime-ccl 2021.9.0-43543 amd64 Intel® oneAPI Collective Communications Library runtime
ii intel-oneapi-runtime-compilers 2023.1.0-46305 amd64 Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii intel-oneapi-runtime-compilers-common 2023.1.0-46305 all Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii intel-oneapi-runtime-dpcpp-cpp 2023.1.0-46305 amd64 Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii intel-oneapi-runtime-dpcpp-cpp-common 2023.1.0-46305 all Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii intel-oneapi-runtime-mkl 2023.1.0-46342 amd64 Intel® oneAPI Math Kernel Library runtime
ii intel-oneapi-runtime-mkl-common 2023.1.0-46342 all Intel® oneAPI Math Kernel Library runtime common
ii intel-oneapi-runtime-mpi 2021.9.0-43482 amd64 Intel® MPI Library runtime
ii intel-oneapi-runtime-opencl 2023.1.0-46305 amd64 Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii intel-oneapi-runtime-openmp 2023.1.0-46305 amd64 Intel® OpenMP* Runtime Library runtime
ii intel-oneapi-runtime-openmp-common 2023.1.0-46305 all l_openmp.runtime.description>
ii intel-oneapi-runtime-tbb 2021.9.0-43484 amd64 Intel® oneAPI Threading Building Blocks runtime
ii intel-oneapi-runtime-tbb-common 2021.9.0-43484 all Intel® oneAPI Threading Building Blocks runtime common
ii intel-opencl-icd 23.05.25593.18-601~22.04 amd64 Intel graphics compute runtime for OpenCL
ii libdrm-intel1:amd64 2.4.115+git2303241447.28d9a3c4~j~mesarc0 amd64 Userspace interface to intel-specific kernel DRM services -- runtime
Code
import torch
import torchvision.models as models
import numpy as np
import intel_extension_for_pytorch as ipex
torch.manual_seed(0)
x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')
print("Mean")
print(torch.mean(x).detach().cpu().numpy())
Crash
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
zeContextCreate = 1 \---> zeContextDestroy = 1
zeCommandQueueCreate = 2 \---> zeCommandQueueDestroy = 2
zeModuleCreate = 2 \---> zeModuleDestroy = 2
zeKernelCreate = 3 \---> zeKernelDestroy = 3
zeEventPoolCreate = 1 \---> zeEventPoolDestroy = 1
zeCommandListCreateImmediate = 1 |
zeCommandListCreate = 5 \---> zeCommandListDestroy = 6
zeEventCreate = 18 \---> zeEventDestroy = 18
zeFenceCreate = 5 \---> zeFenceDestroy = 5
zeImageCreate = 0 \---> zeImageDestroy = 0
zeSamplerCreate = 0 \---> zeSamplerDestroy = 0
zeMemAllocDevice = 2 |
zeMemAllocHost = 0 |
zeMemAllocShared = 0 \---> zeMemFree = 1 ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
Aborted (core dumped)
I am able to replicate the same issue on Fedora 37 with 6.2 and Ubuntu 22.04 with 5.19. Both instances involve a build from the latest xpu-master branch.
It is weird the crash error is only reported when you enable DEBUG flags, otherwise the code silently crashes.
export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1
Here is some quick findings I had, it's not exactly at 4GB, I don't think the gibberish is related...
# All good
import torch
import intel_extension_for_pytorch as ipex
array = torch.rand(40000, 40000, dtype=torch.bfloat16, device='xpu')
print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #3.2GB
print("Mean:", torch.mean(array).item()) #0.5
print("Standard Deviation:", torch.std(array).item()) #0.287109375
# All good
import torch
import intel_extension_for_pytorch as ipex
array = torch.rand(46000, 46000, dtype=torch.bfloat16, device='xpu')
print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #4.232GB
print("Mean:", torch.mean(array).item()) #0.5
print("Standard Deviation:", torch.std(array).item()) #0.2890625
# At 46001x46001 it goes gibberish
import torch
import intel_extension_for_pytorch as ipex
array = torch.rand(46001, 46001, dtype=torch.bfloat16, device='xpu')
print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #4.423218400GB
print("Mean:", torch.mean(array).item()) #0.00372314453125
print("Standard Deviation:", torch.std(array).item()) #0.049072265625
For FP16, I have some other weird bugs that sometimes it works, sometimes it doesn't even for small array (less than 10000x10000). Even for multiple consecutive run, it might work for 50 times in a row, than go bonkers for 10.
For FP32, the gibberish starts appearing at around 30800x30800 which is 3.79456GB. Before that starting around 30400x30400, it is gibberish and then a good output in alternance when doing multiple succesive runs.
Which such numerical instability, I might write a script and test every possible combination at this point, might be worth to take a look at other random sampling methods too.
Just did another quick run for FP32 at 30800x30800 and this time, it works just fine (even 32000x32000 works this time around), there is some weird instability going on...
Quick thought, since I am not using a fixed seed in those tests, might it be that some "bad seeds" are cause the instability?
@fredlarochelle I think some pointers in OneDNN GPU kernel use 32bit unsigned integers and some use 64bit unsigned integers. Reading more than 4GB creates a buffer over-read (reading adjacent memory locations and reading other arrays).
If the adjacent memory locations just so happens to have zeros, then the mean is around 0.
If the adjacent memory locations just so happens to have uniformly distributed values from 0 to 1, then the mean is 0.5 .
It could allow you to read other program's data in the GPU.
@BA8F0D39 That would make sense, but I still do get the instability for FP16 and FP32 start acting weird before it before it would actually overfill a 32bit buffer + instability, there is probably more than one problem going on at the same time.
0.2890625
@fredlarochelle @BA8F0D39 Thanks for feedbacks.
The issue mentioned here (so-called numerical instability) looks like one we met recently in internal test. The issue might be caused cache consistency after global memory fence. We are following.
BTW, as for crashes when allocating memory larger than 4GB, we cannot reproduce on recommended driver.
@arthuryuan1987 On Windows 11 with WSL, it crashes 100% of the time.
On Ubuntu Linux 22.04 with 5.19 out of tree driver (intel-i915-dkms intel-platform-vsec-dkms intel-platform-cse-dkms intel-fw-gpu), it randomly crashes and it is not deterministic. https://dgpu-docs.intel.com/driver/client/overview.html
On Ubuntu Linux 22.04 with 6.3 mainline kernel, it also randomly crashes.
I can force it to crash 100% of the time if you enable debug flags.
export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1
@arthuryuan1987 I am on Ubuntu 22.04.2 5.19.0.41-generic, on the lastest driver, all following the installation instructions in the documentation with a build from the lastest commit in the xpu-master branch.
@arthuryuan1987
I used a Vulkan GPU memory tester. https://github.com/GpuZelenograd/memtest_vulkan
It seems all memory regions above 4GB are corrupt and the read transfer speed is 1.9 GB/s.
./memtest_vulkan 1 9140000000
Error found. Mode NEXT_RE_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x30000000..=0xAFFFFFFF iteration:1
values range: 0x00000000..=0x00000000 FFFFFFFF-like count:0 bit-level stats table:
0x0 0x1 0x2 0x3| 0x4 0x5 0x6 0x7| 0x8 0x9 0xA 0xB| 0xC 0xD 0xE 0xF
SinglIdx | 1 | | 1
TogglCnt 2 56 761|6673 42k 205k793k| 2m 6m 14m 27m| 45m 63m 76m 81m
0x1? 74m 58m 40m 24m| 12m 5m 1m589k|145k 28k 4277 457| 31 1
1sInValu536m | | |
Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0xE0000000..=0x15FFFFFFF iteration:1
values range: 0x00000000..=0x00000000 FFFFFFFF-like count:0 bit-level stats table:
0x0 0x1 0x2 0x3| 0x4 0x5 0x6 0x7| 0x8 0x9 0xA 0xB| 0xC 0xD 0xE 0xF
SinglIdx | 1 | | 1
TogglCnt 2 56 761|6673 42k 205k793k| 2m 6m 14m 27m| 45m 63m 76m 81m
0x1? 74m 58m 40m 24m| 12m 5m 1m589k|145k 28k 4277 457| 31 1
1sInValu536m | | |
Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x190000000..=0x20FFFFFFF iteration:1
values range: 0x00000000..=0x00000000 FFFFFFFF-like count:0 bit-level stats table:
0x0 0x1 0x2 0x3| 0x4 0x5 0x6 0x7| 0x8 0x9 0xA 0xB| 0xC 0xD 0xE 0xF
SinglIdx | 1 | | 1
TogglCnt 2 56 761|6672 42k 205k793k| 2m 6m 14m 27m| 45m 63m 76m 81m
0x1? 74m 58m 40m 24m| 12m 5m 1m589k|145k 28k 4277 457| 31 1
1sInValu536m | | |
Standard 5-minute test of 1: Bus=0x03:00 DevId=0x56A0 16GB Intel(R) Arc(tm) A770 Graphics (DG2)
1 iteration. Passed 5.6310 seconds written: 5.5GB 956.2GB/sec checked: 8.2GB 1.5GB/sec
Error found. Mode NEXT_RE_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x30000000..=0xAFFFFFFF iteration:1
values range: 0x00000000..=0x00000000 FFFFFFFF-like count:0 bit-level stats table:
0x0 0x1 0x2 0x3| 0x4 0x5 0x6 0x7| 0x8 0x9 0xA 0xB| 0xC 0xD 0xE 0xF
SinglIdx | 1 | | 1
TogglCnt 2 56 761|6673 42k 205k793k| 2m 6m 14m 27m| 45m 63m 76m 81m
0x1? 74m 58m 40m 24m| 12m 5m 1m589k|145k 28k 4277 457| 31 1
1sInValu536m | | |
Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0xE0000000..=0x15FFFFFFF iteration:2
values range: 0x00000000..=0x00000000 FFFFFFFF-like count:0 bit-level stats table:
0x0 0x1 0x2 0x3| 0x4 0x5 0x6 0x7| 0x8 0x9 0xA 0xB| 0xC 0xD 0xE 0xF
SinglIdx | 1 | | 1
TogglCnt 2 56 760|6653 42k 204k789k| 2m 6m 14m 27m| 45m 63m 76m 81m
0x1? 74m 58m 40m 24m| 12m 5m 1m589k|145k 28k 4277 457| 31 1
1sInValu536m | | |
@BA8F0D39 I checked the repo, https://github.com/GpuZelenograd/memtest_vulkan It should be OpenCL based application (tool). As I know, A64 stateless addressing has a big performance penalty on ARC. Maybe, I guess OpenCL driver disables >4GB allocation. Regarding stacks of IPEX, not all underlying stacks guarantee A64 stateless addressing. So after next code synchronization, IPEX will raise an explicit error to users, as well.
Could you please provide an update on the status of this issue? On the lastest xpu_master branch, I have observed that it is currently exhibiting intermittent behavior. At times, when allocating a batch size larger than 4 GB, it crashes with the -5 error, while other times it functions correctly without any issues. Or might the -5 error I am getting be related to another issue? Interestingly, from my observations, the error does not seem to occur when the batch size remains under 4 GB.
I am using Xorg on Ubuntu 22.04 . On the xpu_master branch, allocating more than 4GB on pytorch crashes Xorg for some reason. It seems to be overwriting Xorg's memory regions in the GPU
@fredlarochelle @BA8F0D39 After next code synchronization, memory allocation greater than 4G will be disabled on Arc and an error message will be raised when user requests it.
@cchheennhhaaoo That is not a fix at all tho...
@cchheennhhaaoo
I still can allocate more than 4GB on Intel Arc with IPEX 2.0.110+xpu. However, inputting large images into resnet50 produces invalid results even-though only 8GB of 16GB Intel Arc A770 is used.
I have exactly the same bug : during torch finetuning, my script crash with PI_ERROR_INVALID_MEM_OBJECT full stack:
---> piContextRelease(
<unknown> : 0x5691520
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
) ---> pi_result : PI_SUCCESS
---> piKernelRelease(
<unknown> : 0xf4d7d10
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) ---> pi_result : PI_SUCCESS
---> piProgramRelease(
<unknown> : 0xf4c3e20
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
) ---> pi_result : PI_SUCCESS
---> piKernelRelease(
<unknown> : 0x1053fe50
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) ---> pi_result : PI_SUCCESS
---> piKernelRelease(
<unknown> : 0x1053f3e0
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) ---> pi_result : PI_SUCCESS
---> piProgramRelease(
<unknown> : 0xf4e63a0
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
) ---> pi_result : PI_SUCCESS
---> piKernelRelease(
<unknown> : 0xf4e73c0
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) ---> pi_result : PI_SUCCESS
---> piKernelRelease(
<unknown> : 0xf4e56d0
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) ---> pi_result : PI_SUCCESS
---> piProgramRelease(
<unknown> : 0xf4c3f50
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
) ---> pi_result : PI_SUCCESS
---> piDeviceRelease(
<unknown> : 0xf474a70
) ---> pi_result : PI_SUCCESS
---> piDeviceRelease(
<unknown> : 0xf475160
) ---> pi_result : PI_SUCCESS
---> piTearDown(
<unknown> : 0
) ---> pi_result : PI_SUCCESS
[out]void * : 0
---> piTearDown(
<unknown> : 0
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
zeContextCreate = 1 \---> zeContextDestroy = 1
zeCommandQueueCreate = 2 \---> zeCommandQueueDestroy = 2
zeModuleCreate = 3 \---> zeModuleDestroy = 3
zeKernelCreate = 5 \---> zeKernelDestroy = 5
zeEventPoolCreate = 1 \---> zeEventPoolDestroy = 1
zeCommandListCreateImmediate = 1 |
zeCommandListCreate = 3 \---> zeCommandListDestroy = 4
zeEventCreate = 7 \---> zeEventDestroy = 7
zeFenceCreate = 3 \---> zeFenceDestroy = 3
zeImageCreate = 0 \---> zeImageDestroy = 0
zeSamplerCreate = 0 \---> zeSamplerDestroy = 0
zeMemAllocDevice = 2 |
zeMemAllocHost = 0 |
zeMemAllocShared = 0 \---> zeMemFree = 1 ---> LEAK = 1
) ---> pi_result : -38
[out]void * : 0
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
Abandon (core dumped)
i have an Arc A 770 with 16Gb of memory. To do this i i last transformer version wich integrate XPU to compute. Is a fix to use all available memory planned?
@cchheennhhaaoo I still can allocate more than 4GB on Intel Arc with IPEX 2.0.110+xpu. However, inputting large images into resnet50 produces invalid results even-though only 8GB of 16GB Intel Arc A770 is used.
Please check this line in your repo. https://github.com/intel/intel-extension-for-pytorch/blob/xpu-master/csrc/gpu/runtime/CachingDeviceAllocator.cpp#L190
For invalid result issue, please refer to above arthuryuan1987's comment.
@BA8F0D39 @fredlarochelle we don't plan to support this. You can still allocate > 4GB with 2.0.110+xpu because we disabled the allocation in master not the previous released drop. Could you please provide the justification why >4GB allocation is required?