MoltenVK icon indicating copy to clipboard operation
MoltenVK copied to clipboard

Crash Creating Descriptor Pool on ParaVirtualized Device

Open benn-geomagical opened this issue 1 year ago • 7 comments

Vulkan SDK Versions: 1.3.290.0, 1.3.293.0

OS: (uname -a) Darwin vm-osx-sonoma-16-g2-m1.8core-dff65d70-f2cc-4478-9109-1454c98324a3 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:12:39 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_VMAPPLE arm64

Compiler: XCode 16, Apple clang version 16.0.0

This crash occurs on virtualized macOS (Bitrise CI node) when invoking vkCreateDescriptorPool for either graphics or compute pipelines. The same code works without issue on iOS, iOS simulator and macOS (non-virtual) without validator warnings. Issue does not occur in Vulkan SDK 1.3.283.0 using the same compiler and running in the same environment.

Call stack:

*** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[AppleParavirtDevice newArgumentEncoderWithLayout:]: unrecognized selector sent to instance 0x11c813c00'
*** First throw call stack:
(
	0   CoreFoundation                      0x0000000195d0f2ec __exceptionPreprocess + 176
	1   libobjc.A.dylib                     0x00000001957f6788 objc_exception_throw + 60
	2   CoreFoundation                      0x0000000195dc156c -[NSObject(NSObject) __retain_OA] + 0
	3   AppleParavirtGPUMetalIOGPUFamily    0x0000000104968f5c doUncompressedBlit + 11180
	4   Metal                               0x000000019ffd2ec4 -[_MTLDevice newArgumentEncoderWithArguments:structType:] + 136
	5   libMoltenVK.dylib                   0x0000000105a10278 _ZN17MVKDescriptorPool48getMetalArgumentBufferEncodedResourceStorageSizeEmmm + 328
	6   libMoltenVK.dylib                   0x0000000105a0f864 _ZN17MVKDescriptorPool23initMetalArgumentBufferEPK26VkDescriptorPoolCreateInfo + 372
	7   libMoltenVK.dylib                   0x0000000105a0f440 _ZN17MVKDescriptorPoolC2EP9MVKDevicePK26VkDescriptorPoolCreateInfo + 1304
	8   libMoltenVK.dylib                   0x0000000105a8884c _ZN9MVKDevice20createDescriptorPoolEPK26VkDescriptorPoolCreateInfoPK21VkAllocationCallbacks + 48
	9   libMoltenVK.dylib                   0x0000000105a20fc0 vkCreateDescriptorPool + 84
	10  libVkLayer_khronos_validation.dylib 0x000000010bdefbf0 _Z28DispatchCreateDescriptorPoolP10VkDevice_TPK26VkDescriptorPoolCreateInfoPK21VkAllocationCallbacksPP18VkDescriptorPool_T + 136
	11  libVkLayer_khronos_validation.dylib 0x000000010bd48e44 _ZN20vulkan_layer_chassis20CreateDescriptorPoolEP10VkDevice_TPK26VkDescriptorPoolCreateInfoPK21VkAllocationCallbacksPP18VkDescriptorPool_T + 444
	12  libRenderer.dylib                   0x00000001049e8374 _ZN3vkr17DescriptorSetPool6createERKNS_11BindingDataEj + 2144
	13  libRenderer.dylib                   0x00000001049d5630 _ZN3vkr16GraphicsPipeline6createERKNS_27GraphicsPipelineDescriptionE + 564
	14  libRenderer.dylib                   0x00000001049d4968 _ZN3gmg8Renderer16setupUtilityGrmpERKNS_17UtilityShaderCallERNSt3__16vectorIN3vkr11DeviceImageENS4_9allocatorIS7_EEEERKNS_14TextureAttribsERPNS6_10RenderPassERNS6_16GraphicsPipelineERN3vku6Handle	15  libRenderer.dylib                   0x00000001049d2710 _ZN3gmg8Renderer23dispatchUtilityCubeGrmpERKNS_17UtilityShaderCallERNSt3__16vectorIN3vkr11DeviceImageENS4_9allocatorIS7_EEEERKNS_14TextureAttribsERS7_ + 396
	16  libRenderer.dylib                   0x00000001049d10c8 _ZN3gmg8Renderer19dispatchUtilityCallERKNS_17UtilityShaderCallENS_14TextureAttribsENS_13TextureIdSpec7AssetIdERNS_12CaptureImageENS_19UtilityOutputOptionE + 800
	17  decorate-engine                     0x00000001042d8cb8 _ZN3gmg19CompleteEnvironment8BuildCtx19dispatchUtilityCallERNS_17UtilityShaderCallERNS_11TextureSpecE + 80
	18  decorate-engine                     0x00000001042dd12c _ZN3gmg19CompleteEnvironment26buildEnvBaseCubeFromSphereERNS0_8BuildCtxE + 880
	19  decorate-engine                     0x00000001042dc06c _ZN3gmg19CompleteEnvironment16buildEnvBaseCubeERNS0_8BuildCtxE + 204
	20  decorate-engine                     0x00000001042d62ec _ZZN3gmg19CompleteEnvironment5buildENS_11SceneViewIdERKNS_13SceneMetadataERKNS_19PbrSceneLightingUBOEN3glm3vecILi2EjLNS8_9qualifierE0EEEPKcRNS_15RenderInterfaceEENK3$_0clEv + 1092
	21  decorate-engine                     0x000000010426dbc8 _ZN3gmg19CompleteEnvironment5buildENS_11SceneViewIdERKNS_13SceneMetadataERKNS_19PbrSceneLightingUBOEN3glm3vecILi2EjLNS8_9qualifierE0EEEPKcRNS_15RenderInterfaceE + 876
	22  decorate-engine                     0x000000010424bc6c _ZN3gmg6Engine16buildEnvironmentENS_11SceneViewIdE + 636
	23  decorate-engine                     0x0000000104266e1c _ZN3gmg6Engine9loadSceneERKNSt3__18functionIFvRKNS_10SceneSetupEEEERKNS1_12basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEESH_ + 1328
	24  decorate-engine                     0x0000000104271fac _ZN3gmg6Engine12prepareSceneEv + 216
	25  decorate-engine                     0x000000010422a008 _ZN3gmg13DEngineBridge8mainLoopEv + 88
	26  decorate-engine                     0x00000001042299b4 _ZN3gmg13DEngineBridge3runEv + 2672
	27  decorate-engine                     0x00000001042316ac main + 152
	28  dyld                                0x00000001958320e0 start + 2360
)
libc++abi: terminating due to uncaught exception of type NSException

Example of crashing invocation:

  VkDescriptorPool vkPool = VK_NULL_HANDLE;
  VkDescriptorPoolCreateInfo poolInfo{VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO};
  VkDescriptorPoolSize poolSize = {VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, 1};
  poolInfo.maxSets = 1;
  poolInfo.poolSizeCount = 1;
  poolInfo.pPoolSizes = &poolSize;
  auto result = vkCreateDescriptorPool(device, &poolInfo, nullptr, &vkPool);  

benn-geomagical avatar Oct 09 '24 20:10 benn-geomagical

Intersting situation.

The reason you may not have been encountering this in previous versions of MoltenVK is that MoltenVK version v1.2.11 (SDK 1.3.296) defaults to using Metal Argument Buffers, whereas previous versions did not.

To revert, you can set the environment variable MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS=0 when running your app.

If the default setting (argument buffers enabled) is working on devices but not in your CI virtualization environment, there might be a problem with either how your environment reports capabilities, and what is actually supported.

With Sonoma and a Tier 2 GPU, your environment should be using the Metal3 style of argument buffers, which do not use argument encoders. But somehow, this is using argument buffer encoders, and then Metal seems to be failing on the internal calls.

What type of GPU are you running on?

billhollings avatar Oct 10 '24 18:10 billhollings

What type of GPU are you running on?

Based on OP's description, which indicates an ARM64 architecture, I'd assume Apple Silicon.

cdavis5e avatar Oct 10 '24 23:10 cdavis5e

The GPU is the integrated device on the host Apple Silicon M1, but as you point out, the virtualization may be resulting in incorrect results checking capabilities.

Will try the environment variable workaround - thank you, @billhollings

benn-geomagical avatar Oct 29 '24 16:10 benn-geomagical

The environment variable workaround prevents the crash on the CI node. Updating to latest MoltenVk is now unblocked for us. Thank you again, @billhollings

For anyone else hitting this issue during iOS simulator integration tests you'll want to add the same environment variable to the xcscheme for your project.

e.g. In YourSwiftProject/.swiftpm/xcode/xcshareddata/YourProject.xcscheme

<Scheme ...>
   <!-- other scheme stuff -->
  <LaunchAction ...>
    <EnvironmentVariables>
         <EnvironmentVariable
            key = "MVK_CONFIG_USE_METAL_ARGUMENT_BUFFERS"
            value = "0"
            isEnabled = "YES">
         </EnvironmentVariable>
      </EnvironmentVariables>
  </LaunchAction>
</Scheme>  

benn-geomagical avatar Oct 29 '24 17:10 benn-geomagical

I am hitting what appears to be the same issue for the vkd3d CI, whose macOS runner indeed runs inside a virtual machine, therefore with a paravirtualized device. An example of failing log is https://gitlab.winehq.org/giomasce/vkd3d/-/jobs/115045/artifacts/raw/artifacts/000-c930856/tests/hlsl/abs.log.

This was marked as "Question", but it seems there is a real bug here. Not necessarily in MoltenVK, indeed it looks like the bug might be with Apple's driver. Has that been triaged already, and possibly submitted to Apple? If not I might try to do that myself.

giomasce avatar Nov 07 '24 09:11 giomasce

Has that been triaged already, and possibly submitted to Apple? If not I might try to do that myself.

Since this is a tight environmental issue (CI & virtual machines), it's very hard to triage and debug from a general sense.

Any further help you could provide in triaging and debugging in your environment would be most helpful.


It's also curious to me that the error is in a call to -[_MTLDevice newArgumentEncoderWithArguments:structType:]. Metal argument encoders are only used under the following conditions:

_metalFeatures.needsArgumentBufferEncoders = (_metalFeatures.argumentBuffers &&
                                              !(mvkOSVersionIsAtLeast(13.0, 16.0, 1.0) &&
                                                supportsMTLGPUFamily(Metal3) &&
                                                _metalFeatures.argumentBuffersTier >= MTLArgumentBuffersTier2));

On an M1 using at least macOS 13 Ventura MoltenVK should be using Metal3 argument buffers, which do not require arguments encoders. I'm wondering if the virtualization environment is somehow interfering with one or more of the _metalFeatures.needsArgumentBufferEncoders tests above. Although strange, that shouldn't necessary be problematic, since MoltenVK does handle Metal argument encoders on earlier OS version.

In addition, AppleParavirtDevice is not a MTLDevice implementation that I have seen before. Is it an Apple class, or is it coming from a 3rd-party virtualization environment? Based on the the unrecognized selector error, it seems to be an incomplete implementation of MTLDevice.

billhollings avatar Nov 07 '24 15:11 billhollings

It turns out that's pretty easy to reproduce in a virtualized environment. This code is enough to trigger the crash:

import Metal

for gpu in MTLCopyAllDevices() {
  let desc = MTLArgumentDescriptor()
  desc.dataType = .texture
  desc.index = 0

  let descs = [desc]

  let _ = gpu.makeArgumentEncoder(arguments: descs)
}

You can test it with Tart, using the commands provided on https://tart.run/quick-start/, and then runnning swift test.swift on the file above. I tested these combinations:

  • Sequoia host, Sonoma guest: bug happens
  • Sequoia host, Sequoia guest: bug does not happen
  • Sonoma host, Sequoia guest: bug does not happen

So it would seem that the bug was just fixed on Sequoia. Of course that program doesn't test anything else beyond creating an argument buffer encoder, which is very little. Anecdotically, though, I managed to run a few vkd3d tests: there were failures, but failures are expected also on bare metal devices, and I didn't investigate whether there were additional failures which I could attribute to the paravirtualized device. Most of the tests still passed, anyway.

giomasce avatar Nov 08 '24 22:11 giomasce