MoltenVK icon indicating copy to clipboard operation
MoltenVK copied to clipboard

gl_SubgroupInvocationID ranges from 0 to 31 for gl_SubgroupSize 64.

Open FunMiles opened this issue 4 years ago • 9 comments

I am running a very simple compute shader with a single group execution with the below source.

#version 460
#extension GL_KHR_shader_subgroup_basic : enable
layout (local_size_x = 64) in;

layout(std430, set=0, binding=0) buffer Bres {
    uint count[];
};

void main() {
        count[gl_GlobalInvocationID.x] = gl_SubgroupInvocationID + gl_SubgroupSize;
}

The output in the count vector is (16 values per line, with the line start index at the start):

0 : 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
16 : 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
32 : 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
48 : 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

So the variable gl_SubgroupSize is 64 but gl_SubgroupInvocationID only goes from 0 to 31.

On AMD, one expects that subgroup size is 64. The Intel device has subgroup size of 32. However I am running on the discrete GPU.

I believe this is a bug. Am I wrong?

FunMiles avatar Apr 01 '22 00:04 FunMiles

What kind of AMD device? Some AMD devices support 32-invocation subgroups.

cdavis5e avatar Apr 01 '22 00:04 cdavis5e

AMD radeon pro 5500m. Yes it does support 32-invocation subgroups as reported:

        SubgroupSizeControlProperties:
                maxComputeWorkgroupSubgroups  = 32
                maxSubgroupSize               = 64
                minSubgroupSize               = 32

However don't forget the GLSL code does report that the gl_SubgroupSize IS 64, not 32. I should be able to test if an invocation is the last one in the subgroup, but in this case, none of them have an index of 63 so my full code fails.

FunMiles avatar Apr 01 '22 00:04 FunMiles

PS: I also made sure that there's no change in the subgroup size somewhere else. The query of the device does also report the current setting as being 64:

        SubgroupProperties:
                quadOperationsInAllStages = true
                subgroupSize              = 64
                supportedOperations       = { Basic | Vote | Arithmetic | Ballot | Shuffle | ShuffleRelative | Quad }
                supportedStages           = { TessellationControl | Fragment | Compute }

FunMiles avatar Apr 01 '22 00:04 FunMiles

I've encountered the same issue on macOS Sonoma 14.2:

I've set local_size_x via local_size_x_id specialization constant to the value VkPhysicalDeviceSubgroupProperties::subgroupSize. For Intel UHD Graphics 630, I then have gl_SubgroupInvocationID varying from 0 to 16 (excl), but the gl_SubgroupSize value in the Metal shader code gets a hardcoded 32 value. So we only have 16 threads per subgroup inactive. For AMD Radeon Pro 5300M, I have gl_SubgroupInvocationID varying from 0 to 64 (excl) and the gl_SubgroupSize value in the Metal shader code gets a hardcoded 64 value. So here we have 32 threads per subgroup inactive.

The hardcoded gl_SubgroupSize values are the same as in VkPhysicalDeviceSubgroupProperties::subgroupSize. The gl_SubgroupID is also in the range [ 0, gl_NumSubgroups [.

I think this so far is all correct and Vulkan spec conforming.

However, the VkPhysicalDeviceSubgroupSizeControlFeatures::computeFullSubgroups is set, which means VK_PIPELINE_SHADER_STAGE_CREATE_REQUIRE_FULL_SUBGROUPS_BIT can be used ensuring that we don't have any inactive invocations in our subgroup. When I set this and make sure local workgroup size in the X dimension is a multiple of VkPhysicalDeviceSubgroupProperties::subgroupSize, I would expect that the gl_NumSubgroups value is two times lower and the gl_SubgroupInvocationID values to cover the full [ 0, gl_SubgroupSize [ range. That's not the case as I don't see any difference.

In the MoltenVK code, VK_PIPELINE_SHADER_STAGE_CREATE_REQUIRE_FULL_SUBGROUPS_BIT is only used to set the MTLComputePipelineDescriptor::threadGroupSizeIsMultipleOfThreadExecutionWidth property but does that enforce Vulkan's 'full subgroup' requirement ?

Possible options:

  1. Don't set VkPhysicalDeviceSubgroupSizeControlFeatures::computeFullSubgroups anymore (assuming there is no way in Metal to have 'full subgroups').
  2. Leave set VkPhysicalDeviceSubgroupSizeControlFeatures::computeFullSubgroups set but fix gl_SubgroupSize to limit to only the real number of active invocations in a subgroup and support VkPipelineExecutablePropertiesKHR so that after pipeline creation we can know the actual SubgroupSize (which is then different than VkPhysicalDeviceSubgroupProperties::subgroupSize)

Aside, VK_PIPELINE_SHADER_STAGE_CREATE_REQUIRE_FULL_SUBGROUPS_BIT is implied for SPIR-V 1.6 module versions. Is MoltenVK testing on this somewhere ? I can't seem to find this.

jtytgat avatar Mar 01 '24 18:03 jtytgat

~~This is probably a bug in Metal.~~

~~SPIRV-Cross translates BuiltInSubgroupSize (gl_SubgroupSize in GLSL) to Metal's threads_per_simdgroup. The wrong values of this parameter are thus coming from the Metal shader compiler backend, which probably hardcodes its value. This was reasonable before variable subgroup sizes, but not anymore. You should probably file a feedback with Apple.~~ EDIT: Scratch that. We are in fact emitting fixed constants for some reason. I'll have to dig through the history to find out why...

cdavis5e avatar Mar 03 '24 10:03 cdavis5e

Oh now I remember.

At the time I wrote that code, and possibly still today, several of the tests in the CTS depended on the reported SubgroupSize in the shader being the same as the maxSubgroupSize reported through Vulkan. I wonder if the CTS has been changed since then; in that case, we might be able to get rid of this kludge.

cdavis5e avatar Mar 03 '24 11:03 cdavis5e

At the time I wrote that code, and possibly still today, several of the tests in the CTS depended on the reported SubgroupSize in the shader being the same as the maxSubgroupSize reported through Vulkan. I wonder if the CTS has been changed since then; in that case, we might be able to get rid of this kludge.

Which CTS tests? I can look for them in the next CTS run.

billhollings avatar Mar 06 '24 21:03 billhollings

Which CTS tests? I can look for them in the next CTS run.

I don't remember off the top of my head but I believe it was something under the dEQP-VK.subgroups.* hierarchy, possibly the builtin variables tests. Apparently, I added this when I first implemented subgroups for Vulkan 1.1 support. I think this was before I implemented VK_EXT_subgroup_size_control, because enabling varying subgroup sizes (VK_PIPELINE_SHADER_STAGE_CREATE_ALLOW_VARYING_SUBGROUP_SIZE_BIT) is supposed to turn this off.

Hey... @FunMiles, @jtytgat, did you try setting that bit?

cdavis5e avatar Mar 06 '24 22:03 cdavis5e