VulkanMemoryAllocator icon indicating copy to clipboard operation
VulkanMemoryAllocator copied to clipboard

Please reconsider the new memory usage flags

Open Novum opened this issue 2 years ago • 1 comments

The new API is extremely confusing. It should not pretend to know better what memory type I need.

E.g. people will start using VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT for staging buffers because hey, they are writing only sequentially to it right? And then get handed GPU local host visible memory which is slow to write to, potentially to small for what they want to stage and slow for the GPU to copy from e.g. on AMD where the transfer queue VRAM to VRAM copies are broken.

I don't think it is in the spirit of Vulkan to hide the machine from the user. It just makes it unclear of what's actually going to happen.

Novum avatar May 18 '22 21:05 Novum

More specifically, I believe we are hitting a problem that results from the combination of the following factors in VMA:

  1. When creating a custom pool, you have to provide a single memory type. That means, if the memory type happens to be very limited on this machine (like the 256Mb GPU-local host-visible memory), it will likely hit OOM, and the resulting code would be fragile. Any reason a custom pool needs to be specified an exact memory type? Ideally, you'd just say what kind of memory you want (exactly like you do for general allocations), and VMA manages the exact memory types automatically.
  2. Host HOST_ACCESS_SEQUENTIAL_WRITE + PREFER_HOST always results in HOST_CACHED being undesirable. This sounds normal on paper, but in practice NVidia machines don't expose system memory that isn't cached, so VMA ends up looking for device-local memory, and it shouldn't do that. This combination of flags should always result in system memory (aka D3D12 upload heap).
  3. When choosing the minimal cost, each bit is considered equal weight. In this case, we have one bit that wants it to not be HOST_CACHED and another bit that wants it to not be DEVICE_LOCAL, and we end up with a tie, and VMA chooses DEVICE_LOCAL memory. This hurts performance significantly and leads to OOM in our case.

I have a suggestion on how to solve this. VMA could store a flag somewhere internally for "device_has_non_cached_host_memory". If this flag is true, then it should behave the same as today. However, if the flag is false (and this is the case on NVidia), it shouldn't try to look for non-cached memory. How does this sound to you?

kvark avatar May 18 '22 22:05 kvark

I think I got bit by this. Investigated a slowdown related to vertex/index buffers that are updated every frame from the CPU. Bisecting lead me to commit that integrated VMA.

VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT is used for these buffers. When integrating VMA code creating buffers was modified to pass VMA_MEMORY_USAGE_AUTO and VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT. This results in memory with VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT which in turn leads to writing to these buffers on the CPU taking a lot longer on both the Nvidia and AMD GPUs I have to test with. For Nvidia it takes ~1.8 times as long and for AMD ~1.3 times as long. (For integrated Intel GPU I have tested with it does not matter since VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT is set for the memory type both before and after VMA integration, which is not surprising).

I have currently hacked around this by always passing VMA_MEMORY_USAGE_AUTO_PREFER_HOST if VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT is passed to the buffer creation function.

martin-ejdestig avatar Jan 13 '23 17:01 martin-ejdestig

I am sorry for delayed response.

@Novum: I understand your point. The purpose of introducing VMA_MEMORY_USAGE_AUTO* flags was to make the API of the library as simple to use as possible in basic use cases and for beginners. If you understand caveats like the performance of VRAM to VRAM copy on different hardware queues, you likely want to use VMA_MEMORY_USAGE_AUTO_PREFER_DEVICE / _HOST or narrow down the choice of selected memory types using e.g. VmaAllocationCreateInfo::requiredFlags or VmaAllocationCreateInfo::memoryTypeBits.

Answering your specific concerns:

  • Access patterns unfriendly to uncached and write-combined memory (other than sequential writes) are very slow regardless of using system RAM of VRAM.
  • Sequential writes through a pointer to VRAM going through PCIe Gen 4 are almost as fast as writes to system RAM - same order of magniture.
  • If you use graphics or compute queue not copy queue for your copies, they will be fast regardless of copying from system RAM or VRAM.
  • If the CPU-visible VRAM is small, Vulkan will either fail the allocation (that will make VMA fall back to system RAM) or return success and migrate something to system RAM silently, which is usually what you want in this case.

See also this article, which is the result of performance experiments we did at AMD: https://gpuopen.com/learn/get-the-most-out-of-smart-access-memory/

@kvark:

  1. I agree with you that if you create a custom pool that ends up in a memory heap with small capacity, it can cause problems. Because we talk about choosing memory type index explicitly, I would recommend to inspect heap sizes first before making the decision or, even better, not using custom pools. Custom pools aren't needed as often as developers use them.
  2. I agree with you this is a flaw of the current algorithm. It shouldn't choose DEVICE_LOCAL memory in this case. I will fix it.

@martin-ejdestig: I am sorry to hear you experienced decreased performance after integrating VMA. I think your workaround of using VMA_MEMORY_USAGE_AUTO_PREFER_HOST is a good solution. However, I also recommend to inspect your code that writes into the mapped pointer, whether it really is a sequential write and not random access or it doesn't involve any reads, even implicit like pData[i] += a. Maybe try to prepare the data in a local variable and use memcpy to copy them to the mapped pointer and see if it helps the performance.

adam-sawicki-a avatar Jan 23 '23 15:01 adam-sawicki-a

@kvark: Can you please describe your case in more details? I tried to reproduce the issue with memory type selection in my test environment but I couldn't. What memory heaps and types are available on your GPU? What GPU and platform is this?

I was looking for an Nvidia GPU like you described in the database https://vulkan.gpuinfo.org/. On Windows as well as Linux the CPU-visible VRAM is DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT and only CPU (non-DEVICE_LOCAL) memory is HOST_CACHED.

https://vulkan.gpuinfo.org/displayreport.php?id=18348#memory https://vulkan.gpuinfo.org/displayreport.php?id=18590#memory

I found a GPU that has all HOST_VISIBLE memory also HOST_CACHED on MacOS: https://vulkan.gpuinfo.org/displayreport.php?id=18304#memory But in my experiments, specifying VMA_MEMORY_USAGE_AUTO_PREFER_HOST + VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT correctly selects the non-DEVICE_LOCAL memory type.

What VkBufferUsage / VkImageUsage flags do you use?

adam-sawicki-a avatar Jan 23 '23 16:01 adam-sawicki-a

@martin-ejdestig: I am sorry to hear you experienced decreased performance after integrating VMA. I think your workaround of using VMA_MEMORY_USAGE_AUTO_PREFER_HOST is a good solution. However, I also recommend to inspect your code that writes into the mapped pointer, whether it really is a sequential write and not random access or it doesn't involve any reads, even implicit like pData[i] += a. Maybe try to prepare the data in a local variable and use memcpy to copy them to the mapped pointer and see if it helps the performance.

It is sequential. Simple for loop that iterates over data and writes indices and vertices and no modification with += etc. ("Ping pongs" between vertex and index buffer though, if that matters. I tried writing to buffers that were then copied with memcpy() before integrating VMA. If I remember correctly it only got slower.)

Anyway, I was just surprised by this. But I also see the value of having VMA_MEMORY_USAGE_*.

So... I do not know. I feel that I am only adding noise here, will go back to lurking in the shadows. :)

martin-ejdestig avatar Jan 23 '23 17:01 martin-ejdestig

Using VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT, I would have expected to always receive a memory type that is HOST_COHERENT. However, on MacOS and most mobile devices, unless you use VMA_MEMORY_USAGE_AUTO_PREFER_HOST, you receive the memory type that is DEVICE_LOCAL | HOST_VISIBLE | HOST_CACHED. That doesn't seem to me like what the user would want, and so I'm wondering if this is the intended bevahior. I might even prefer DEVICE_LOCAL memory for e.g. my uniform buffer, but if that memory is not HOST_COHERENT, is it worth it?

What speaks against requiring HOST_COHERENT when using VMA_ALLOCATION_CREATE_HOST_ACCESS_SEQUENTIAL_WRITE_BIT? To my knowledge that wouldn't affect any PCs or laptops, only Macs, MacBooks, and mobile devices.

marc0246 avatar Jul 16 '23 11:07 marc0246

VMA doesn't automatically prefer HOST_COHERENT memory. The only difference between HOST_COHERENT and non-HOST_COHERENT memory is that you need to call flush/invalidate after you write / before you read the memory through a mapped pointer. I don't think this is a big burden. On Windows PC, all GPU vendors provide all memory types that are HOST_VISIBLE also as HOST_COHERENT. If the platforms you use don't do it, you can just call vmaFlushAllocation / vmaInvalidateAllocation every time it is needed. There is no need to check if the memory is HOST_COHERENT. VMA checks that automatically and in case it is, the function does nothing.

Alternatively, you can request the memory to be HOST_COHERENT by adding:

allocInfo.requiredFlags = VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;

This can be freely mixed with other parameters of VmaAllocationCreateInfo, like the usage flags. However, I don't recommend this method. If your platform provides a non-HOST_COHERENT memory type that meets all your other requirements and is higher on the list of memory types, it my be more efficient.

adam-sawicki-a avatar Jul 17 '23 08:07 adam-sawicki-a

My understanding was that HOST_COHERENT memory is additionally faster for the host to write because writes can be combined. Is this not the case if the memory is HOST_COHERENT | HOST_CACHED?

However, I don't recommend this method. If your platform provides a non-HOST_COHERENT memory type that meets all your other requirements and is higher on the list of memory types, it my be more efficient.

This question of mine arose precisely because of this: the devices I was looking at had this DEVICE_LOCAL memory type at a lower position than the one that's HOST_COHERENT and not DEVICE_LOCAL. Yet using VMA_MEMORY_USAGE_AUTO, you would get the DEVICE_LOCAL one. Would you still recommend this memory type in this case?

Memory properties of a MacOS device
VkPhysicalDeviceMemoryProperties:
=================================
memoryHeaps: count = 2
	memoryHeaps[0]:
		size   = 4294967296 (0x100000000) (4.00 GiB)
		budget = 4294967296 (0x100000000) (4.00 GiB)
		usage  = 0 (0x00000000) (0.00 B)
		flags: count = 1
			MEMORY_HEAP_DEVICE_LOCAL_BIT
	memoryHeaps[1]:
		size   = 17179869184 (0x400000000) (16.00 GiB)
		budget = 4093403136 (0xf3fc6000) (3.81 GiB)
		usage  = 6578176 (0x00646000) (6.27 MiB)
		flags:
			None
memoryTypes: count = 3
	memoryTypes[0]:
		heapIndex     = 0
		propertyFlags = 0x0001: count = 1
			MEMORY_PROPERTY_DEVICE_LOCAL_BIT
		usable for:
			IMAGE_TILING_OPTIMAL:
				color images
				FORMAT_D16_UNORM
				FORMAT_D32_SFLOAT
				FORMAT_S8_UINT
				FORMAT_D24_UNORM_S8_UINT
				FORMAT_D32_SFLOAT_S8_UINT
				(non-sparse)
			IMAGE_TILING_LINEAR:
				color images
				(non-sparse, non-transient)
	memoryTypes[1]:
		heapIndex     = 1
		propertyFlags = 0x000e: count = 3
			MEMORY_PROPERTY_HOST_VISIBLE_BIT
			MEMORY_PROPERTY_HOST_COHERENT_BIT
			MEMORY_PROPERTY_HOST_CACHED_BIT
		usable for:
			IMAGE_TILING_OPTIMAL:
				None
			IMAGE_TILING_LINEAR:
				color images
				(non-sparse, non-transient)
	memoryTypes[2]:
		heapIndex     = 0
		propertyFlags = 0x000b: count = 3
			MEMORY_PROPERTY_DEVICE_LOCAL_BIT
			MEMORY_PROPERTY_HOST_VISIBLE_BIT
			MEMORY_PROPERTY_HOST_CACHED_BIT
		usable for:
			IMAGE_TILING_OPTIMAL:
				color images
				(non-sparse)
			IMAGE_TILING_LINEAR:
				color images
				(non-sparse, non-transient)

marc0246 avatar Jul 17 '23 12:07 marc0246

Although not fully documented but rather platform-specific, I would expect HOST_CACHED flag to determine what you mentioned: With this flag, CPU accesses to the memory are cached, without it - they are uncached but write-combined. I think HOST_COHERENT is less related to that. Possibly, HOST_COHERENT is slower in some way because of the overhead of the cache coherency between host and device that must be ensured automatically.

This question of mine arose precisely because of this: the devices I was looking at had this DEVICE_LOCAL memory type at a lower position than the one that's HOST_COHERENT and not DEVICE_LOCAL. Yet using VMA_MEMORY_USAGE_AUTO, you would get the DEVICE_LOCAL one. Would you still recommend this memory type in this case?

From the memory heaps and types you showed, I would expect:

  • Type 0 is the video memory, not accessible to the host. Good to be used for GPU-only resources like color attachment images or sampled images that are uploaded once.
  • Type 1 is the system memory, good for staging buffers used as a source or destination of a transfer.
  • Type 2 is the CPU-visible video memory, good writing data directly from the CPU and reading them on the GPU, like uniform buffers changing every render frame.

adam-sawicki-a avatar Jul 21 '23 08:07 adam-sawicki-a

I realized as well what I said doesn't make much sense, a bit silly of me. What you're suggesting makes the most sense. Thanks a lot for taking the time to explain! I really appreciate it. :) It's a bit sad that the best way to learn about what these flags do is by word of mouth.

marc0246 avatar Jul 21 '23 09:07 marc0246