intel-extension-for-pytorch icon indicating copy to clipboard operation
intel-extension-for-pytorch copied to clipboard

Arrays larger than 4 GB crashes

Open BA8F0D39 opened this issue 2 years ago • 53 comments

Describe the bug

Intel compute runtime doesn't allow allocating a buffer bigger than 4 GB.

https://github.com/intel/compute-runtime/issues/627

When you allocate an array in intel-extension-for-pytorch bigger than 4 GB in A770 16GB, it crashes.

x = torch.rand(46000, 46000, dtype=torch.float32, device='xpu')

Is it possible to allocate multiple buffers for an array instead of allocating one buffer for one array?

Versions

Collecting environment information...
PyTorch version: 1.13.0a0+gitb1dde16
PyTorch CXX11 ABI: Yes
IPEX version: 1.13.10+xpu
IPEX commit: 7d85b0e92
Build type: Release

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: N/A
IGC version: N/A
CMake version: N/A
Libc version: glibc-2.35

Python version: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-6.3.0-1-x86_64-with-glibc2.35
Is XPU available: True
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration: 
[0] _DeviceProperties(name='Intel(R) Graphics [0x56a0]', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=0, total_memory=15473MB, max_compute_units=512)
Intel OpenCL ICD version: 22.43.24595.35+i538~22.04
Level Zero version: 1.3.24595.35+i538~22.04

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          20
On-line CPU(s) list:             0-19
Vendor ID:                       GenuineIntel
BIOS Vendor ID:                  Intel(R) Corporation
Model name:                      13th Gen Intel(R) Core(TM) i5-13600K
BIOS Model name:                 13th Gen Intel(R) Core(TM) i5-13600K
CPU family:                      6
Model:                           183
Thread(s) per core:              2
Core(s) per socket:              14
Socket(s):                       1
Stepping:                        1
CPU max MHz:                     5100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6991.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       544 KiB (14 instances)
L1i cache:                       704 KiB (14 instances)
L2 cache:                        20 MiB (8 instances)
L3 cache:                        24 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-19
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==1.13.10+xpu
[pip3] numpy==1.24.1
[pip3] torch==1.13.0a0+gitb1dde16
[pip3] torchvision==0.14.1a0+0504df5
[conda] N/A

BA8F0D39 avatar Apr 08 '23 21:04 BA8F0D39

@tye1

jingxu10 avatar Apr 16 '23 21:04 jingxu10

I did some further tests and it seems like allocating more than 4GB returns garbage or randomly crashes.

Example of allocating less than 4GB in A770 16GB. The mean is around 0.5 which is expected.

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(30000, 30000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())


python3 ./test.py 
 Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Mean
0.50001085

Example of allocating more than 4GB on CPU. The mean is around 0.5 which is expected.

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='cpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())



python3 ./test.py 
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Mean
0.4999941

Example of allocating more than 4GB on A770 16GB. The mean is around 0.014 which is completely wrong.

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())


python3 ./test.py 
/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Mean
0.014004011

In conclusion, allocating more than 4GB crashes or returns complete garbage.

BA8F0D39 avatar Apr 17 '23 06:04 BA8F0D39

@jingxu10 Is memory allocation done by OpenCL, Level Zero, or OneDNN?

BA8F0D39 avatar Apr 24 '23 22:04 BA8F0D39

It should be allocated by Level-0. @gujinghui

jingxu10 avatar Apr 24 '23 22:04 jingxu10

@jingxu10

Will passing -ze-opt-greater-than-4GB-buffer-required into the build options fix it?

https://spec.oneapi.io/level-zero/latest/core/PROG.html#module-build-options

BA8F0D39 avatar Apr 25 '23 02:04 BA8F0D39

Hi, @BA8F0D39
What's the driver version? I cannot reproduce randomly crash with agama-ci-devel-602. From what I've tried, the max workable input shape of your ut is about 59500*59500, corresponds memory size of 13.2G. It is a reasonable result. For accuracy issue, we will check it.

cchheennhhaaoo avatar Apr 27 '23 06:04 cchheennhhaaoo

Hi @BA8F0D39

Thank you for using intel product and IPEX. Now we can successfully create large memory(not larger than total physical memory size) and compute well. Can you provide the driver version you are using by the below? sudo dpkg -l | grep intel

And is it possible to add the following flags and attach the log here when you find the error?

export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1

Thank you.

zejun-chen avatar Apr 27 '23 06:04 zejun-chen

@cchheennhhaaoo

On windows 11 WSL

ii  intel-level-zero-gpu                  1.3.24595.35+i538~22.04                 amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-oneapi-runtime-ccl              2021.8.0-25371                          amd64        Intel® oneAPI Collective Communications Library runtime
ii  intel-oneapi-runtime-compilers        2023.0.0-25370                          amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-compilers-common 2023.0.0-25370                          all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-dpcpp-cpp        2023.0.0-25370                          amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-dpcpp-cpp-common 2023.0.0-25370                          all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-mkl              2023.0.0-25398                          amd64        Intel® oneAPI Math Kernel Library runtime
ii  intel-oneapi-runtime-mkl-common       2023.0.0-25398                          all          Intel® oneAPI Math Kernel Library runtime common
ii  intel-oneapi-runtime-mpi              2021.8.0-25329                          amd64        Intel® MPI Library runtime
ii  intel-oneapi-runtime-opencl           2023.0.0-25370                          amd64        Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii  intel-oneapi-runtime-openmp           2023.0.0-25370                          amd64        Intel® OpenMP* Runtime Library runtime
ii  intel-oneapi-runtime-openmp-common    2023.0.0-25370                          all          l_openmp.runtime.description>
ii  intel-oneapi-runtime-tbb              2021.8.0-25334                          amd64        Intel® oneAPI Threading Building Blocks runtime
ii  intel-oneapi-runtime-tbb-common       2021.8.0-25334                          all          Intel® oneAPI Threading Building Blocks runtime common
ii  intel-opencl-icd                      22.43.24595.35+i538~22.04               amd64        Intel graphics compute runtime for OpenCL

Code

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1
          zeCommandQueueCreate = 1     \--->         zeCommandQueueDestroy = 1
                zeModuleCreate = 1     \--->               zeModuleDestroy = 1
                zeKernelCreate = 1     \--->               zeKernelDestroy = 1
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
  zeCommandListCreateImmediate = 1     |
           zeCommandListCreate = 2     \--->          zeCommandListDestroy = 3
                 zeEventCreate = 8     \--->                zeEventDestroy = 8
                 zeFenceCreate = 2     \--->                zeFenceDestroy = 2
                 zeImageCreate = 0     \--->                zeImageDestroy = 0
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
              zeMemAllocDevice = 1     |
                zeMemAllocHost = 0     |
              zeMemAllocShared = 0     \--->                     zeMemFree = 0     ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
Aborted

crashlog.txt

BA8F0D39 avatar Apr 27 '23 07:04 BA8F0D39

@cchheennhhaaoo @zejun-chen

On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.

ii  intel-level-zero-gpu                  1.3.25593.18-601~22.04                   amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-oneapi-runtime-ccl              2021.9.0-43543                           amd64        Intel® oneAPI Collective Communications Library runtime
ii  intel-oneapi-runtime-compilers        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-compilers-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-dpcpp-cpp        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-dpcpp-cpp-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-mkl              2023.1.0-46342                           amd64        Intel® oneAPI Math Kernel Library runtime
ii  intel-oneapi-runtime-mkl-common       2023.1.0-46342                           all          Intel® oneAPI Math Kernel Library runtime common
ii  intel-oneapi-runtime-mpi              2021.9.0-43482                           amd64        Intel® MPI Library runtime
ii  intel-oneapi-runtime-opencl           2023.1.0-46305                           amd64        Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii  intel-oneapi-runtime-openmp           2023.1.0-46305                           amd64        Intel® OpenMP* Runtime Library runtime
ii  intel-oneapi-runtime-openmp-common    2023.1.0-46305                           all          l_openmp.runtime.description>
ii  intel-oneapi-runtime-tbb              2021.9.0-43484                           amd64        Intel® oneAPI Threading Building Blocks runtime
ii  intel-oneapi-runtime-tbb-common       2021.9.0-43484                           all          Intel® oneAPI Threading Building Blocks runtime common
ii  intel-opencl-icd                      23.05.25593.18-601~22.04                 amd64        Intel graphics compute runtime for OpenCL
ii  libdrm-intel1:amd64                   2.4.115+git2303241447.28d9a3c4~j~mesarc0 amd64        Userspace interface to intel-specific kernel DRM services -- runtime

Code

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())

Crash

ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1    
          zeCommandQueueCreate = 2     \--->         zeCommandQueueDestroy = 2    
                zeModuleCreate = 2     \--->               zeModuleDestroy = 2    
                zeKernelCreate = 3     \--->               zeKernelDestroy = 3    
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1    
  zeCommandListCreateImmediate = 1     | 
           zeCommandListCreate = 5     \--->          zeCommandListDestroy = 6    
                 zeEventCreate = 18    \--->                zeEventDestroy = 18   
                 zeFenceCreate = 5     \--->                zeFenceDestroy = 5    
                 zeImageCreate = 0     \--->                zeImageDestroy = 0    
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0    
              zeMemAllocDevice = 2     | 
                zeMemAllocHost = 0     | 
              zeMemAllocShared = 0     \--->                     zeMemFree = 1     ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
 
 
Aborted (core dumped)

crash2.txt

BA8F0D39 avatar Apr 27 '23 07:04 BA8F0D39

I believe this issue is caused by incorrect env setting. You can follow this blog to setup IPEX environment on WSL2 with docker: https://medium.com/intel-analytics-software/stable-diffusion-with-intel-arc-gpus-f2986bba8365

cchheennhhaaoo avatar Apr 27 '23 08:04 cchheennhhaaoo

@cchheennhhaaoo @zejun-chen I have the same problem on Ubuntu Linux too (not using windows)

On Ubuntu 22.04 Linux 6.3. It also crashes, but only after I close python.

ii  intel-level-zero-gpu                  1.3.25593.18-601~22.04                   amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-oneapi-runtime-ccl              2021.9.0-43543                           amd64        Intel® oneAPI Collective Communications Library runtime
ii  intel-oneapi-runtime-compilers        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-compilers-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime common files
ii  intel-oneapi-runtime-dpcpp-cpp        2023.1.0-46305                           amd64        Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-dpcpp-cpp-common 2023.1.0-46305                           all          Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic runtime
ii  intel-oneapi-runtime-mkl              2023.1.0-46342                           amd64        Intel® oneAPI Math Kernel Library runtime
ii  intel-oneapi-runtime-mkl-common       2023.1.0-46342                           all          Intel® oneAPI Math Kernel Library runtime common
ii  intel-oneapi-runtime-mpi              2021.9.0-43482                           amd64        Intel® MPI Library runtime
ii  intel-oneapi-runtime-opencl           2023.1.0-46305                           amd64        Intel® CPU Runtime for OpenCL(TM) Applications runtime
ii  intel-oneapi-runtime-openmp           2023.1.0-46305                           amd64        Intel® OpenMP* Runtime Library runtime
ii  intel-oneapi-runtime-openmp-common    2023.1.0-46305                           all          l_openmp.runtime.description>
ii  intel-oneapi-runtime-tbb              2021.9.0-43484                           amd64        Intel® oneAPI Threading Building Blocks runtime
ii  intel-oneapi-runtime-tbb-common       2021.9.0-43484                           all          Intel® oneAPI Threading Building Blocks runtime common
ii  intel-opencl-icd                      23.05.25593.18-601~22.04                 amd64        Intel graphics compute runtime for OpenCL
ii  libdrm-intel1:amd64                   2.4.115+git2303241447.28d9a3c4~j~mesarc0 amd64        Userspace interface to intel-specific kernel DRM services -- runtime

Code

import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)

x = torch.rand(47000, 47000, dtype=torch.float32, device='xpu')

print("Mean")
print(torch.mean(x).detach().cpu().numpy())

Crash

ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1    
          zeCommandQueueCreate = 2     \--->         zeCommandQueueDestroy = 2    
                zeModuleCreate = 2     \--->               zeModuleDestroy = 2    
                zeKernelCreate = 3     \--->               zeKernelDestroy = 3    
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1    
  zeCommandListCreateImmediate = 1     | 
           zeCommandListCreate = 5     \--->          zeCommandListDestroy = 6    
                 zeEventCreate = 18    \--->                zeEventDestroy = 18   
                 zeFenceCreate = 5     \--->                zeFenceDestroy = 5    
                 zeImageCreate = 0     \--->                zeImageDestroy = 0    
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0    
              zeMemAllocDevice = 2     | 
                zeMemAllocHost = 0     | 
              zeMemAllocShared = 0     \--->                     zeMemFree = 1     ---> LEAK = 1
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
 
 
Aborted (core dumped)

crash2.txt

BA8F0D39 avatar Apr 27 '23 08:04 BA8F0D39

I am able to replicate the same issue on Fedora 37 with 6.2 and Ubuntu 22.04 with 5.19. Both instances involve a build from the latest xpu-master branch.

fredlarochelle avatar May 12 '23 22:05 fredlarochelle

It is weird the crash error is only reported when you enable DEBUG flags, otherwise the code silently crashes.

export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1

BA8F0D39 avatar May 13 '23 20:05 BA8F0D39

Here is some quick findings I had, it's not exactly at 4GB, I don't think the gibberish is related...

# All good
import torch
import intel_extension_for_pytorch as ipex

array = torch.rand(40000, 40000, dtype=torch.bfloat16, device='xpu')

print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #3.2GB
print("Mean:", torch.mean(array).item()) #0.5
print("Standard Deviation:", torch.std(array).item()) #0.287109375
# All good
import torch
import intel_extension_for_pytorch as ipex

array = torch.rand(46000, 46000, dtype=torch.bfloat16, device='xpu')

print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #4.232GB
print("Mean:", torch.mean(array).item()) #0.5
print("Standard Deviation:", torch.std(array).item()) #0.2890625
# At 46001x46001 it goes gibberish
import torch
import intel_extension_for_pytorch as ipex

array = torch.rand(46001, 46001, dtype=torch.bfloat16, device='xpu')

print(f"The memory of the array is {(array.element_size() * array.nelement()) / 1e9}GB.") #4.423218400GB
print("Mean:", torch.mean(array).item()) #0.00372314453125
print("Standard Deviation:", torch.std(array).item()) #0.049072265625

For FP16, I have some other weird bugs that sometimes it works, sometimes it doesn't even for small array (less than 10000x10000). Even for multiple consecutive run, it might work for 50 times in a row, than go bonkers for 10.

For FP32, the gibberish starts appearing at around 30800x30800 which is 3.79456GB. Before that starting around 30400x30400, it is gibberish and then a good output in alternance when doing multiple succesive runs.

Which such numerical instability, I might write a script and test every possible combination at this point, might be worth to take a look at other random sampling methods too.

fredlarochelle avatar May 25 '23 05:05 fredlarochelle

Just did another quick run for FP32 at 30800x30800 and this time, it works just fine (even 32000x32000 works this time around), there is some weird instability going on...

Quick thought, since I am not using a fixed seed in those tests, might it be that some "bad seeds" are cause the instability?

fredlarochelle avatar May 25 '23 05:05 fredlarochelle

@fredlarochelle I think some pointers in OneDNN GPU kernel use 32bit unsigned integers and some use 64bit unsigned integers. Reading more than 4GB creates a buffer over-read (reading adjacent memory locations and reading other arrays).

If the adjacent memory locations just so happens to have zeros, then the mean is around 0.

If the adjacent memory locations just so happens to have uniformly distributed values from 0 to 1, then the mean is 0.5 .

It could allow you to read other program's data in the GPU.

BA8F0D39 avatar May 25 '23 20:05 BA8F0D39

@BA8F0D39 That would make sense, but I still do get the instability for FP16 and FP32 start acting weird before it before it would actually overfill a 32bit buffer + instability, there is probably more than one problem going on at the same time.

fredlarochelle avatar May 25 '23 20:05 fredlarochelle

0.2890625

@fredlarochelle @BA8F0D39 Thanks for feedbacks.

The issue mentioned here (so-called numerical instability) looks like one we met recently in internal test. The issue might be caused cache consistency after global memory fence. We are following.

BTW, as for crashes when allocating memory larger than 4GB, we cannot reproduce on recommended driver.

fengyuan14 avatar May 26 '23 01:05 fengyuan14

@arthuryuan1987 On Windows 11 with WSL, it crashes 100% of the time.

On Ubuntu Linux 22.04 with 5.19 out of tree driver (intel-i915-dkms intel-platform-vsec-dkms intel-platform-cse-dkms intel-fw-gpu), it randomly crashes and it is not deterministic. https://dgpu-docs.intel.com/driver/client/overview.html

On Ubuntu Linux 22.04 with 6.3 mainline kernel, it also randomly crashes.

I can force it to crash 100% of the time if you enable debug flags.

export SYCL_PI_TRACE=-1
export ZE_DEBUG=-1

BA8F0D39 avatar May 26 '23 22:05 BA8F0D39

@arthuryuan1987 I am on Ubuntu 22.04.2 5.19.0.41-generic, on the lastest driver, all following the installation instructions in the documentation with a build from the lastest commit in the xpu-master branch.

fredlarochelle avatar May 27 '23 00:05 fredlarochelle

@arthuryuan1987

I used a Vulkan GPU memory tester. https://github.com/GpuZelenograd/memtest_vulkan

It seems all memory regions above 4GB are corrupt and the read transfer speed is 1.9 GB/s.

./memtest_vulkan 1 9140000000 
Error found. Mode NEXT_RE_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x30000000..=0xAFFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6673 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0xE0000000..=0x15FFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6673 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x190000000..=0x20FFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6672 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Standard 5-minute test of 1: Bus=0x03:00 DevId=0x56A0   16GB Intel(R) Arc(tm) A770 Graphics (DG2)
      1 iteration. Passed  5.6310 seconds  written:    5.5GB 956.2GB/sec        checked:    8.2GB   1.5GB/sec
Error found. Mode NEXT_RE_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0x30000000..=0xAFFFFFFF  iteration:1
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 761|6673 42k 205k793k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

Error found. Mode INITIAL_READ, total errors 0x20000000 out of 0x2C000000 (72.72727273%)
Errors address range: 0xE0000000..=0x15FFFFFFF  iteration:2
values range: 0x00000000..=0x00000000   FFFFFFFF-like count:0    bit-level stats table:
         0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
SinglIdx                 |   1             |                 |       1         
TogglCnt       2   56 760|6653 42k 204k789k|  2m  6m  14m 27m| 45m 63m  76m 81m
   0x1?  74m 58m  40m 24m| 12m  5m   1m589k|145k 28k 4277 457|  31   1         
1sInValu536m             |                 |                 |                 

BA8F0D39 avatar Jun 09 '23 04:06 BA8F0D39

@BA8F0D39 I checked the repo, https://github.com/GpuZelenograd/memtest_vulkan It should be OpenCL based application (tool). As I know, A64 stateless addressing has a big performance penalty on ARC. Maybe, I guess OpenCL driver disables >4GB allocation. Regarding stacks of IPEX, not all underlying stacks guarantee A64 stateless addressing. So after next code synchronization, IPEX will raise an explicit error to users, as well.

fengyuan14 avatar Jun 12 '23 03:06 fengyuan14

Could you please provide an update on the status of this issue? On the lastest xpu_master branch, I have observed that it is currently exhibiting intermittent behavior. At times, when allocating a batch size larger than 4 GB, it crashes with the -5 error, while other times it functions correctly without any issues. Or might the -5 error I am getting be related to another issue? Interestingly, from my observations, the error does not seem to occur when the batch size remains under 4 GB.

fredlarochelle avatar Jun 24 '23 00:06 fredlarochelle

I am using Xorg on Ubuntu 22.04 . On the xpu_master branch, allocating more than 4GB on pytorch crashes Xorg for some reason. It seems to be overwriting Xorg's memory regions in the GPU

BA8F0D39 avatar Jun 25 '23 01:06 BA8F0D39

@fredlarochelle @BA8F0D39 After next code synchronization, memory allocation greater than 4G will be disabled on Arc and an error message will be raised when user requests it.

cchheennhhaaoo avatar Jun 25 '23 02:06 cchheennhhaaoo

@cchheennhhaaoo That is not a fix at all tho...

fredlarochelle avatar Jun 29 '23 20:06 fredlarochelle

@cchheennhhaaoo
I still can allocate more than 4GB on Intel Arc with IPEX 2.0.110+xpu. However, inputting large images into resnet50 produces invalid results even-though only 8GB of 16GB Intel Arc A770 is used.

BA8F0D39 avatar Aug 23 '23 00:08 BA8F0D39

I have exactly the same bug : during torch finetuning, my script crash with PI_ERROR_INVALID_MEM_OBJECT full stack:


---> piContextRelease(
        <unknown> : 0x5691520
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventDestroy(Event->ZeEvent)
ZE ---> zeEventPoolDestroy(ZePool)
ZE ---> zeCommandListDestroy(ZeCommandListInit)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeCommandListDestroy(ZeCommandList)
ZE ---> zeMemFree(Context->ZeContext, Ptr)
ZE ---> zeContextDestroy(DestoryZeContext)
) --->  pi_result : PI_SUCCESS

---> piKernelRelease(
        <unknown> : 0xf4d7d10
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) --->  pi_result : PI_SUCCESS

---> piProgramRelease(
        <unknown> : 0xf4c3e20
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
) --->  pi_result : PI_SUCCESS

---> piKernelRelease(
        <unknown> : 0x1053fe50
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) --->  pi_result : PI_SUCCESS

---> piKernelRelease(
        <unknown> : 0x1053f3e0
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) --->  pi_result : PI_SUCCESS

---> piProgramRelease(
        <unknown> : 0xf4e63a0
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
) --->  pi_result : PI_SUCCESS

---> piKernelRelease(
        <unknown> : 0xf4e73c0
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) --->  pi_result : PI_SUCCESS

---> piKernelRelease(
        <unknown> : 0xf4e56d0
ZE ---> zeKernelDestroy(Kernel->ZeKernel)
PI ---> piProgramRelease(KernelProgram)
) --->  pi_result : PI_SUCCESS

---> piProgramRelease(
        <unknown> : 0xf4c3f50
ZE ---> zeModuleBuildLogDestroy(ZeBuildLog)
ZE ---> zeModuleDestroy(ZeModule)
) --->  pi_result : PI_SUCCESS

---> piDeviceRelease(
        <unknown> : 0xf474a70
) --->  pi_result : PI_SUCCESS

---> piDeviceRelease(
        <unknown> : 0xf475160
) --->  pi_result : PI_SUCCESS

---> piTearDown(
        <unknown> : 0
) --->  pi_result : PI_SUCCESS
        [out]void * : 0

---> piTearDown(
        <unknown> : 0
ZE_DEBUG=4: check balance of create/destroy calls
----------------------------------------------------------
               zeContextCreate = 1     \--->              zeContextDestroy = 1
          zeCommandQueueCreate = 2     \--->         zeCommandQueueDestroy = 2
                zeModuleCreate = 3     \--->               zeModuleDestroy = 3
                zeKernelCreate = 5     \--->               zeKernelDestroy = 5
             zeEventPoolCreate = 1     \--->            zeEventPoolDestroy = 1
  zeCommandListCreateImmediate = 1     |
           zeCommandListCreate = 3     \--->          zeCommandListDestroy = 4
                 zeEventCreate = 7     \--->                zeEventDestroy = 7
                 zeFenceCreate = 3     \--->                zeFenceDestroy = 3
                 zeImageCreate = 0     \--->                zeImageDestroy = 0
               zeSamplerCreate = 0     \--->              zeSamplerDestroy = 0
              zeMemAllocDevice = 2     |
                zeMemAllocHost = 0     |
              zeMemAllocShared = 0     \--->                     zeMemFree = 1     ---> LEAK = 1
) --->  pi_result : -38
        [out]void * : 0

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
Abandon (core dumped)

i have an Arc A 770 with 16Gb of memory. To do this i i last transformer version wich integrate XPU to compute. Is a fix to use all available memory planned?

Serizao avatar Sep 05 '23 22:09 Serizao

@cchheennhhaaoo I still can allocate more than 4GB on Intel Arc with IPEX 2.0.110+xpu. However, inputting large images into resnet50 produces invalid results even-though only 8GB of 16GB Intel Arc A770 is used.

Please check this line in your repo. https://github.com/intel/intel-extension-for-pytorch/blob/xpu-master/csrc/gpu/runtime/CachingDeviceAllocator.cpp#L190

For invalid result issue, please refer to above arthuryuan1987's comment.

cchheennhhaaoo avatar Sep 07 '23 03:09 cchheennhhaaoo

@BA8F0D39 @fredlarochelle we don't plan to support this. You can still allocate > 4GB with 2.0.110+xpu because we disabled the allocation in master not the previous released drop. Could you please provide the justification why >4GB allocation is required?

tye1 avatar Sep 19 '23 08:09 tye1