VulkanMemoryAllocator
VulkanMemoryAllocator copied to clipboard
Please reconsider the new memory usage flags
The new API is extremely confusing. It should not pretend to know better what memory type I need.
E.g. people will start using VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT
for staging buffers because hey, they are writing only sequentially to it right? And then get handed GPU local host visible memory which is slow to write to, potentially to small for what they want to stage and slow for the GPU to copy from e.g. on AMD where the transfer queue VRAM to VRAM copies are broken.
I don't think it is in the spirit of Vulkan to hide the machine from the user. It just makes it unclear of what's actually going to happen.
More specifically, I believe we are hitting a problem that results from the combination of the following factors in VMA:
- When creating a custom pool, you have to provide a single memory type. That means, if the memory type happens to be very limited on this machine (like the 256Mb GPU-local host-visible memory), it will likely hit OOM, and the resulting code would be fragile. Any reason a custom pool needs to be specified an exact memory type? Ideally, you'd just say what kind of memory you want (exactly like you do for general allocations), and VMA manages the exact memory types automatically.
- Host
HOST_ACCESS_SEQUENTIAL_WRITE
+PREFER_HOST
always results inHOST_CACHED
being undesirable. This sounds normal on paper, but in practice NVidia machines don't expose system memory that isn't cached, so VMA ends up looking for device-local memory, and it shouldn't do that. This combination of flags should always result in system memory (aka D3D12 upload heap). - When choosing the minimal cost, each bit is considered equal weight. In this case, we have one bit that wants it to not be
HOST_CACHED
and another bit that wants it to not beDEVICE_LOCAL
, and we end up with a tie, and VMA choosesDEVICE_LOCAL
memory. This hurts performance significantly and leads to OOM in our case.
I have a suggestion on how to solve this. VMA could store a flag somewhere internally for "device_has_non_cached_host_memory". If this flag is true, then it should behave the same as today. However, if the flag is false (and this is the case on NVidia), it shouldn't try to look for non-cached memory. How does this sound to you?
I think I got bit by this. Investigated a slowdown related to vertex/index buffers that are updated every frame from the CPU. Bisecting lead me to commit that integrated VMA.
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
is used for these buffers. When integrating VMA code creating buffers was modified to pass VMA_MEMORY_USAGE_AUTO
and VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT
. This results in memory with VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT
which in turn leads to writing to these buffers on the CPU taking a lot longer on both the Nvidia and AMD GPUs I have to test with. For Nvidia it takes ~1.8 times as long and for AMD ~1.3 times as long. (For integrated Intel GPU I have tested with it does not matter since VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT
is set for the memory type both before and after VMA integration, which is not surprising).
I have currently hacked around this by always passing VMA_MEMORY_USAGE_AUTO_PREFER_HOST
if VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT
is passed to the buffer creation function.
I am sorry for delayed response.
@Novum: I understand your point. The purpose of introducing VMA_MEMORY_USAGE_AUTO*
flags was to make the API of the library as simple to use as possible in basic use cases and for beginners. If you understand caveats like the performance of VRAM to VRAM copy on different hardware queues, you likely want to use VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE
/ _HOST
or narrow down the choice of selected memory types using e.g. VmaAllocationCreateInfo::requiredFlags
or VmaAllocationCreateInfo::memoryTypeBits
.
Answering your specific concerns:
- Access patterns unfriendly to uncached and write-combined memory (other than sequential writes) are very slow regardless of using system RAM of VRAM.
- Sequential writes through a pointer to VRAM going through PCIe Gen 4 are almost as fast as writes to system RAM - same order of magniture.
- If you use graphics or compute queue not copy queue for your copies, they will be fast regardless of copying from system RAM or VRAM.
- If the CPU-visible VRAM is small, Vulkan will either fail the allocation (that will make VMA fall back to system RAM) or return success and migrate something to system RAM silently, which is usually what you want in this case.
See also this article, which is the result of performance experiments we did at AMD: https://gpuopen.com/learn/get-the-most-out-of-smart-access-memory/
@kvark:
- I agree with you that if you create a custom pool that ends up in a memory heap with small capacity, it can cause problems. Because we talk about choosing memory type index explicitly, I would recommend to inspect heap sizes first before making the decision or, even better, not using custom pools. Custom pools aren't needed as often as developers use them.
- I agree with you this is a flaw of the current algorithm. It shouldn't choose
DEVICE_LOCAL
memory in this case. I will fix it.
@martin-ejdestig: I am sorry to hear you experienced decreased performance after integrating VMA. I think your workaround of using VMA_MEMORY_USAGE_AUTO_PREFER_HOST
is a good solution. However, I also recommend to inspect your code that writes into the mapped pointer, whether it really is a sequential write and not random access or it doesn't involve any reads, even implicit like pData[i] += a
. Maybe try to prepare the data in a local variable and use memcpy
to copy them to the mapped pointer and see if it helps the performance.
@kvark: Can you please describe your case in more details? I tried to reproduce the issue with memory type selection in my test environment but I couldn't. What memory heaps and types are available on your GPU? What GPU and platform is this?
I was looking for an Nvidia GPU like you described in the database https://vulkan.gpuinfo.org/. On Windows as well as Linux the CPU-visible VRAM is DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT
and only CPU (non-DEVICE_LOCAL
) memory is HOST_CACHED
.
https://vulkan.gpuinfo.org/displayreport.php?id=18348#memory https://vulkan.gpuinfo.org/displayreport.php?id=18590#memory
I found a GPU that has all HOST_VISIBLE
memory also HOST_CACHED
on MacOS:
https://vulkan.gpuinfo.org/displayreport.php?id=18304#memory
But in my experiments, specifying VMA_MEMORY_USAGE_AUTO_PREFER_HOST
+ VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT
correctly selects the non-DEVICE_LOCAL
memory type.
What VkBufferUsage
/ VkImageUsage
flags do you use?
@martin-ejdestig: I am sorry to hear you experienced decreased performance after integrating VMA. I think your workaround of using
VMA_MEMORY_USAGE_AUTO_PREFER_HOST
is a good solution. However, I also recommend to inspect your code that writes into the mapped pointer, whether it really is a sequential write and not random access or it doesn't involve any reads, even implicit likepData[i] += a
. Maybe try to prepare the data in a local variable and usememcpy
to copy them to the mapped pointer and see if it helps the performance.
It is sequential. Simple for loop that iterates over data and writes indices and vertices and no modification with +=
etc. ("Ping pongs" between vertex and index buffer though, if that matters. I tried writing to buffers that were then copied with memcpy()
before integrating VMA. If I remember correctly it only got slower.)
Anyway, I was just surprised by this. But I also see the value of having VMA_MEMORY_USAGE_*
.
So... I do not know. I feel that I am only adding noise here, will go back to lurking in the shadows. :)
Using VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT
, I would have expected to always receive a memory type that is HOST_COHERENT
. However, on MacOS and most mobile devices, unless you use VMA_MEMORY_USAGE_AUTO_PREFER_HOST
, you receive the memory type that is DEVICE_LOCAL | HOST_VISIBLE | HOST_CACHED
. That doesn't seem to me like what the user would want, and so I'm wondering if this is the intended bevahior. I might even prefer DEVICE_LOCAL
memory for e.g. my uniform buffer, but if that memory is not HOST_COHERENT
, is it worth it?
What speaks against requiring HOST_COHERENT
when using VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT
? To my knowledge that wouldn't affect any PCs or laptops, only Macs, MacBooks, and mobile devices.
VMA doesn't automatically prefer HOST_COHERENT
memory. The only difference between HOST_COHERENT
and non-HOST_COHERENT
memory is that you need to call flush/invalidate after you write / before you read the memory through a mapped pointer. I don't think this is a big burden. On Windows PC, all GPU vendors provide all memory types that are HOST_VISIBLE
also as HOST_COHERENT
. If the platforms you use don't do it, you can just call vmaFlushAllocation
/ vmaInvalidateAllocation
every time it is needed. There is no need to check if the memory is HOST_COHERENT
. VMA checks that automatically and in case it is, the function does nothing.
Alternatively, you can request the memory to be HOST_COHERENT
by adding:
allocInfo.requiredFlags = VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
This can be freely mixed with other parameters of VmaAllocationCreateInfo
, like the usage flags. However, I don't recommend this method. If your platform provides a non-HOST_COHERENT
memory type that meets all your other requirements and is higher on the list of memory types, it my be more efficient.
My understanding was that HOST_COHERENT
memory is additionally faster for the host to write because writes can be combined. Is this not the case if the memory is HOST_COHERENT | HOST_CACHED
?
However, I don't recommend this method. If your platform provides a non-
HOST_COHERENT
memory type that meets all your other requirements and is higher on the list of memory types, it my be more efficient.
This question of mine arose precisely because of this: the devices I was looking at had this DEVICE_LOCAL
memory type at a lower position than the one that's HOST_COHERENT
and not DEVICE_LOCAL
. Yet using VMA_MEMORY_USAGE_AUTO
, you would get the DEVICE_LOCAL
one. Would you still recommend this memory type in this case?
Memory properties of a MacOS device
VkPhysicalDeviceMemoryProperties: ================================= memoryHeaps: count = 2 memoryHeaps[0]: size = 4294967296 (0x100000000) (4.00 GiB) budget = 4294967296 (0x100000000) (4.00 GiB) usage = 0 (0x00000000) (0.00 B) flags: count = 1 MEMORY_HEAP_DEVICE_LOCAL_BIT memoryHeaps[1]: size = 17179869184 (0x400000000) (16.00 GiB) budget = 4093403136 (0xf3fc6000) (3.81 GiB) usage = 6578176 (0x00646000) (6.27 MiB) flags: None memoryTypes: count = 3 memoryTypes[0]: heapIndex = 0 propertyFlags = 0x0001: count = 1 MEMORY_PROPERTY_DEVICE_LOCAL_BIT usable for: IMAGE_TILING_OPTIMAL: color images FORMAT_D16_UNORM FORMAT_D32_SFLOAT FORMAT_S8_UINT FORMAT_D24_UNORM_S8_UINT FORMAT_D32_SFLOAT_S8_UINT (non-sparse) IMAGE_TILING_LINEAR: color images (non-sparse, non-transient) memoryTypes[1]: heapIndex = 1 propertyFlags = 0x000e: count = 3 MEMORY_PROPERTY_HOST_VISIBLE_BIT MEMORY_PROPERTY_HOST_COHERENT_BIT MEMORY_PROPERTY_HOST_CACHED_BIT usable for: IMAGE_TILING_OPTIMAL: None IMAGE_TILING_LINEAR: color images (non-sparse, non-transient) memoryTypes[2]: heapIndex = 0 propertyFlags = 0x000b: count = 3 MEMORY_PROPERTY_DEVICE_LOCAL_BIT MEMORY_PROPERTY_HOST_VISIBLE_BIT MEMORY_PROPERTY_HOST_CACHED_BIT usable for: IMAGE_TILING_OPTIMAL: color images (non-sparse) IMAGE_TILING_LINEAR: color images (non-sparse, non-transient)
Although not fully documented but rather platform-specific, I would expect HOST_CACHED
flag to determine what you mentioned: With this flag, CPU accesses to the memory are cached, without it - they are uncached but write-combined. I think HOST_COHERENT
is less related to that. Possibly, HOST_COHERENT
is slower in some way because of the overhead of the cache coherency between host and device that must be ensured automatically.
This question of mine arose precisely because of this: the devices I was looking at had this DEVICE_LOCAL memory type at a lower position than the one that's HOST_COHERENT and not DEVICE_LOCAL. Yet using VMA_MEMORY_USAGE_AUTO, you would get the DEVICE_LOCAL one. Would you still recommend this memory type in this case?
From the memory heaps and types you showed, I would expect:
- Type 0 is the video memory, not accessible to the host. Good to be used for GPU-only resources like color attachment images or sampled images that are uploaded once.
- Type 1 is the system memory, good for staging buffers used as a source or destination of a transfer.
- Type 2 is the CPU-visible video memory, good writing data directly from the CPU and reading them on the GPU, like uniform buffers changing every render frame.
I realized as well what I said doesn't make much sense, a bit silly of me. What you're suggesting makes the most sense. Thanks a lot for taking the time to explain! I really appreciate it. :) It's a bit sad that the best way to learn about what these flags do is by word of mouth.