VulkanMemoryAllocator icon indicating copy to clipboard operation
VulkanMemoryAllocator copied to clipboard

Possible race condition on VmaAllocator between defragmentation and regular allocations

Open sbourasse opened this issue 2 years ago • 4 comments

We've been using VMA extensively for our memory management on projects with heavy memory load. However, upon integrating the defragmentation utility, we started getting frequent memory corruption on internal VMA data. We narrowed down the source of this corruption to concurrent accesses on our VmaAllocator and (apparently) succeeded in preventing it by blocking allocations for the duration of the defragmentation. More context follows, with a couple questions at the end.

We use a single VmaAllocator from two separate threads. Most of our allocations are done from the rendering thread, which is also the one that runs the defragmentation. We also allocate from the main thread, although less frequently.

Our defragmentation integration looks something like this (it has the same structure as what can be found in the online documentation) :

vmaBeginDefragmentation(mainAllocator, ...);

for (;;)
{
	if (vmaBeginDefragmentationPass(mainAllocator, ...) == VK_SUCCESS)
	   break;

	CreateResourcesAndCopy();
	WaitForCompletion();

	if (vmaEndDefragmentationPass(mainAllocator, ...) == VK_SUCCESS)
		break;
}

vmaEndDefragmentation(mainAllocator, ...);

Running this bit of code on heavy workloads from our rendering thread while our main thread is free to use mainAllocator for other allocations, we reliably trigger memory corruptions, specifically located on the first 4 bytes of random VmaAllocation from our working VmaDefragmentationPassMoveInfo::pMoves array.

Regardless of this particular symptom, we managed to prevent such corruption by implementing a mutex targeting the region between vmaBeginDefragmentation() and vmaEndDefragmentation() as well as other VMA calls on mainAllocator, specifically those that could be made concurrently to defragmentation.

As a side note, we first prevented memory corruption by enabling VMA_DEBUG_GLOBAL_MUTEX, changing VmaMutex::m_Mutex type to std::recursive_mutex, and locking it from our side right after vmaBeginDefragmentation() and releasing it right before vmaEndDefragmentation().

Does it make sense to treat the region from vmaBeginDefragmentation() to vmaEndDefragmentation() as a critical section with regards to the used VmaAllocator ? If so, is it something that could be enforced from within the library ?

sbourasse avatar Jan 18 '23 16:01 sbourasse

Thank you for reporting this bug and a good description. It sounds very serious. We will investigate it. Any help in debugging it is welcome.

Blocking any usage of the allocator between Begin and EndDefragmentation may be a good workaround for the bug, but it shouldn't be needed. Defragmentation was developed with the assumption that using the allocator while defragmentation is in progress is allowed.

Is your code 32- or 64-bit?

adam-sawicki-a avatar Jan 19 '23 10:01 adam-sawicki-a

We triggered this issue on x86-64 with an AMD GPU. We can definitely lend a hand with debugging, just let me know what we can do.

sbourasse avatar Jan 19 '23 10:01 sbourasse

Forgive me if I'm wrong, but I've been looking at the defrag code recently and the map transferring code seems really sketchy to me.

  • There's a period of time after the allocation data swap but before maps are transferred to the new block when an allocation thinks it has references to a block but doesn't (unmapping during this period of time can cause explosions or cause the reference count to underflow)
  • Copies to the new memory can occur while the same memory is being written elsewhere. Results vary based on memory type
  • New maps can be opened on memory that's already been copied to the new location but before the allocation swap, resulting in a write that isn't actually going to be used. It's possible for the block to be fully unmapped or freed by the defragmenter while this is happening.

It just seems like from a logical standpoint, from the moment the copy begins to the end of the pass, a memory block being moved needs to be unwritable. I'm not saying that VMA needs to do this, it probably makes more sense for us users to take care of that. More generally, the VMA documentation says that allocations are not considered thread safe, and the defrag manipulates them.

Am I wrong here?

CannibalVox avatar Apr 07 '23 00:04 CannibalVox