compute-runtime icon indicating copy to clipboard operation
compute-runtime copied to clipboard

Visible Memory on Intel ARC GPUs for OpenCL and Level Zero

Open jjfumero opened this issue 2 years ago • 20 comments

I am using the Intel ARC 750 GPU, which has 8GB of memory. However, when running OpenCL and Level Zero, I only see 6..3 GB.

Output from clinfo using Windows WSL:

  ... 
  Global memory size                              6791413760 (6.325GiB)

I wonder about the reasons not to see the whole GPU global memory space for OpenCL/Level Zero. Is this a bug?

The same happens for Linux Ubuntu 22.04 and the latest compute runtime driver 22.43.24558.

An example of just querying the device info for both Level Zero and OpenCL:

Driver: SPIRV
  Total number of SPIRV devices  : 1
        SPIRV -- SPIRV LevelZero - Intel(R) Graphics [0x56a1]
                Global Memory Size: 6.3 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: 1024
                Max WorkGroup Configuration: [1024, 1024, 1024]
                Device OpenCL C version:  (LEVEL ZERO) 1.3

Driver: OpenCL
  Total number of OpenCL devices  : 1
        OPENCL --  [Intel(R) OpenCL HD Graphics] -- Intel(R) Graphics [0x56a1]
                Global Memory Size: 6.3 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: 1024
                Max WorkGroup Configuration: [1024, 1024, 1024]
                Device OpenCL C version: OpenCL C 1.2

jjfumero avatar Nov 29 '22 16:11 jjfumero

please see here https://github.com/intel/compute-runtime/blob/20b6c76298c6e42dc78c03805b9cb22691ad1f15/shared/source/os_interface/linux/drm_memory_manager.cpp#L1000

driver reports 80% of available memory to account for memory needed for internal allocations.

jandres742 avatar Nov 30 '22 00:11 jandres742

@jjfumero to confirm, is your device integrated or discrete?

jandres742 avatar Nov 30 '22 00:11 jandres742

Thanks @jandres742.

I am using Intel discrete GPU. However, perhaps I should open a separate issue, the device name displayed from OpenCL and Level Zero is the same as the Intel integrated GPU Intel(R) Graphics only for Linux. It is a bit confusing, especially when I have both enabled.

Going back to the amount of visible memory on the GPU.

  • From the code snippet that you highlighted, islocalMemorySupported refers to the shared memory (CUDA terms)?
  • Is the same percentage for Windows and Linux?
  • Unless that remaining 20% is fully occupied for every application, does it make sense to tune these rations in Level Zero? For instance, under some circumstances can I declare 90%-10%?

jjfumero avatar Nov 30 '22 06:11 jjfumero

From the code snippet that you highlighted, islocalMemorySupported refers to the shared memory (CUDA terms)?

device memory

Is the same percentage for Windows and Linux?

for windows is 0.8 always. for linux, 0.95 for discrete, 0.8 for integrated. Dont know if we need to change it.

Unless that remaining 20% is fully occupied for every application, does it make sense to tune these rations in Level Zero? For instance, under some circumstances can I declare 90%-10%?

this 20% is occupied by UMD, not the application. And yes, one possibility would be to provide an option to tune up this, but so far, we have seen this to be the best value experimentally.

jandres742 avatar Nov 30 '22 06:11 jandres742

Thanks @jandres742 . That clarifies the questions.

jjfumero avatar Nov 30 '22 06:11 jjfumero

Linux kernel will automatically page memory between system and device memory when amount of device memory gets low (until things OOM), but such paging can have fairly visible impact on performance...

eero-t avatar Dec 08 '22 19:12 eero-t

Hi @eero-t . sorry, I got lost, How is this related to the questions?

jjfumero avatar Dec 09 '22 09:12 jjfumero

@jjfumero It's just another reason why kernel & user space drivers would want to make sure app's memory usage together with drivers' own overheads keeps below HW total. User space driver uses the value provided by kernel, from which kernel driver has already deducted its own overheads.

Btw. You can monitor GPU memory usage within whole system using Sysman APIs (provided by compute-runtime when it's built with level-zero headers present): https://spec.oneapi.io/level-zero/latest/sysman/api.html#zesmemorygetstate

Compute-runtime includes small CLI tool to query these metrics: https://github.com/intel/compute-runtime/blob/master/level_zero/tools/test/black_box_tests/zello_sysman.cpp

After installing level-zero-dev package, you can build it with: g++ -O -o zello_sysman zello_sysman.cpp -lze_loader -locloc

With level-zero & intel-level-zero-gpu (L0 backend provided by compute-runtime) packages present, it will show you all kinds of GPU metrics (supported by your HW + FW + kernel + user-space driver combo).

But if you want something more extensive, there's also XPU Manager: https://github.com/intel/xpumanager

eero-t avatar Dec 09 '22 09:12 eero-t

@eero-t thank you. All these tools look great. I will take a look.

jjfumero avatar Dec 09 '22 09:12 jjfumero

This looks like its resolved. If not, please re-open or create a new issue.

AdamCetnerowski avatar Jan 12 '23 13:01 AdamCetnerowski

for windows is 0.8 always. for linux, 0.95 for discrete, 0.8 for integrated. Dont know if we need to change it.

@jandres742 the OpenCL specification says that 100% of VRAM capacity should be reported in CL_DEVICE_GLOBAL_MEM_SIZE, no matter what. Reporting different (both wrong) values on Windows and Linux makes no sense. The reported value is the only way for software to know the VRAM capacity of the device, and software must be able to rely on it. If only 6.5GB are reported on an 8GB A750, the software thinks it's only 6.5GB and throws an error if the user wants to use all 8GB.

All other GPUs report their full VRAM capacity, just Intel Arc doesn't. This same discussion happened in 2009 for AMD GPUs.

This is a bug, please fix it.

ProjectPhysX avatar Mar 13 '23 21:03 ProjectPhysX

thanks @ProjectPhysX . could you point to the text you are referring to? This is what I see in https://man.opencl.org/clGetDeviceInfo.html

CL_DEVICE_GLOBAL_MEM_SIZE cl_ulong Size of global device memory in bytes.

but it is not explicitly mentioned what should be returned. There's another interpretation also here https://stackoverflow.com/questions/4394819/cl-device-global-mem-size-returns-wrong-value

Also, by returning the value that the OpenCL GPU driver currently does, isn't it doing as you mentioned above?

The reported value is the only way for software to know the VRAM capacity of the device, and software must be able to rely on it.

OpenCL GPU driver doesn't return the full memory since the software (i.e., the OpenCL application) cannot rely on that full amount, as there's some memory needed for internal allocations. So it is returning effectively the memory it can rely on.

Now, for GPUs from other vendors, I guess what you are saying is that the full memory is returned, but then at allocation time an error is returned, would that be correct? if so, then it would be a matter of thinking which one is best: to know how much memory is truly available upfront to avoid unnecessary calls later, or to wait for calls to fail when no more memory is available.

Please let us know what you think.

jandres742 avatar Mar 13 '23 21:03 jandres742

Hi @jandres742, "global memory" for GPUs is the VRAM, and the size of global memory in bytes refers to the full physical VRAM capacity.

Returning only 80%/95% is arbitrary and not consistent with this. Usually almost the full amount of VRAM can be occupied by an application, and if it's too much, for example when another application already uses part of VRAM, there will be an error during buffer allocation. Since there always is the possibility that other applications already occupy part of the VRAM, checking for errors during buffer allocation is not "unnecessary calls" but mandatory. All existing OpenCL applications already do it. The overhead for checking the returned error code is totally negligible compared to the runtime it takes for the buffer allocation itself.

Saying only 13.6GB is relyably available on a 16GB GPU is false. There is some overhead by the Windows OS, in the order of 100-300MB, but not 20% of 16GB. Linux has 0 overhead in VRAM. Giving up on lump sum 20%/3.6GB of VRAM capacity is clearly the worse option. In this case, to use the full VRAM capacity that customers paid for when they bought the hardware, applications need to

  1. know that Intel's global memory size reporting is wrong and there is actually more memory available than reported, which could be 25% more or 5.3% more but is not clear, and
  2. do the "unnecessary calls" to check for valid buffer allocation beyond the reported spec, because it could be that only 5.3% more is available but maybe also 25%.

This seems the worst possible option. Especially 1. is a vendor-specific bug workaround that should not have to be present in any software and defeats the purpose of a cross-vendor platform like OpenCL.

ProjectPhysX avatar Mar 14 '23 05:03 ProjectPhysX

thanks @ProjectPhysX . could you point to the text you are referring to? This is what I see in https://man.opencl.org/clGetDeviceInfo.html CL_DEVICE_GLOBAL_MEM_SIZE cl_ulong Size of global device memory in bytes. ... OpenCL GPU driver doesn't return the full memory since the software (i.e., the OpenCL application) cannot rely on that full amount, as there's some memory needed for internal allocations. So it is returning effectively the memory it can rely on.

I think the relevant part of the referred AMD discussion is that there's CL_DEVICE_MAX_MEM_ALLOC_SIZE attribute which value should be constrained by the available memory (and how large block of memory operations can address in general, e.g. in case something would still be relying on 32-bit offsets), so CL_DEVICE_GLOBAL_MEM_SIZE can provide larger value.

eero-t avatar Mar 14 '23 07:03 eero-t

We are investigating which values should be set. Re-opening until then.

AdamCetnerowski avatar Apr 20 '23 11:04 AdamCetnerowski

I see the ratio for linux has been increased to 98%: https://github.com/intel/compute-runtime/commit/e8ac22c26508f1f32eba5f4057d3d9917bf352ff Do we have a plan to increase that for Windows? The 80% ratio is really limiting AI workload capabilities for Arc.

Nuullll avatar Jul 07 '23 05:07 Nuullll

Just make it 100% already on all platforms...

ProjectPhysX avatar Jul 07 '23 05:07 ProjectPhysX

Good news! I was able to override the default ratio with the following environments:

NEOReadDebugKeys=1
ClDeviceGlobalMemSizeAvailablePercent=100

https://github.com/intel/compute-runtime/blob/8ed2cb2bfe7d749a8f5958da83e431fed1af0564/shared/source/device/device.cpp#L597-L602

Nuullll avatar Jul 07 '23 06:07 Nuullll

We are investigating which values should be set. Re-opening until then.

@AdamCetnerowski Any conclusions?

eero-t avatar Dec 13 '23 09:12 eero-t

To summarize where we're at:

  • We have documented the current limits
  • Reporting 100% would be misleading as it is impossible to allocate all local memory to the user (some of it is reserved for internal usage)
  • We are working on aligning Linux discrete to 98%.

AdamCetnerowski avatar Dec 28 '23 12:12 AdamCetnerowski