OpenCL-CTS
OpenCL-CTS copied to clipboard
Monkey-patch around a crash when SG size > 128 WIs
This PR is more of a discussion opener:
cl_khr_sub_group_ballot seems to implicitly assume max 128 sized SGs due to the return value type uint4. However, the max SG size is not stated anywhere in the specs and thus should not affect the other sub group functionality to my understanding.
The 128 assumption is used for WI masking for the basic subgroup test, causing a crash when exceeding it.
This just ups the limit to 1280 to push the limit up to work around the issue. A proper fix would be to use dynamic bit vector size here or define a max SG size in the specs (which doesn't make sense).
I've just hit this crash too. From my initial investigation, most of the tests use a local work-group size of 200. It's entirely possible for the sub-group size to be equal to the work-group size - it's explicitly called "the degenerate case" in the spec. From this it's obvious that an assumed limit of 128 is incorrect.
I can't imagine that a sub-group could ever be larger than a work-group as the spec is quite clear that sub-groups are an arrangement of work-items in a work-group. So that could be a reasonable limit from the tests' point of view?
I found an old version of the non-uniform subgroups spec where we recorded this issue:
In Vulkan, the ballot function returns a uint4, which effectively caps the largest subgroup size at 128. Is this big enough?
Supporting a maximum subgroup size of 128 seems reasonable. Implementations where a work group and subgroup are synonymous may restrict the maximum work group size to 128 if a subgroup ballot function is used.
So, we at least thought about this case.
Note, restricting "the maximum work group size to 128" doesn't restrict the device's maximum work-group size, so CL_DEVICE_MAX_WORK_GROUP_SIZE would be unchanged, and only the per-kernel CL_KERNEL_WORK_GROUP_SIZE would be affected for these types of implementations. Given the uint4 return type for the subgroup ballot functions I can't think of another solution that would work.
@bashbaug That's interesting: I hadn't actually read up on the ballot function until now. I think I agree with your interpretation - if a kernel uses the ballot builtin, its max work-group size must be at most 128.
From the CTS's point of view, though, the ballot tests are using a local work-group size of 64 so this issue shouldn't even come up in those tests if we agree the sub-group size should never exceed the work-group size. It's the rest of the subgroup tests that use local work sizes of 200 that are liable to exceed the bounds of the bitset in this fashion.
So I think that either the tests should lower their work sizes to at most 128, use more dynamic data structures, or have a global maximum work-group size that matches the size of the bitset used - ideally a limit that's caught if people try fiddling with the work-group sizes locally when debugging and accidentally exceed it.
It's the rest of the subgroup tests that use local work sizes of 200 that are liable to exceed the bounds of the
bitsetin this fashion.
I think I get it, but just to be sure: the issue is unrelated to the ballot tests but rather is for the other non-uniform subgroup tests. The other non-uniform subgroup tests are using a local work size of 200 and therefore a subgroup size of up to 200, but are only using a 128-bit bitset for masks, which is insufficient.
If this is correct, the easiest fix IMHO is to choose a smaller local work size for these tests, say 100 rather than 200. We could expand the size of the bitsets also but this would be a more intrusive change I think, since we'd probably want to update our mask generation.
@svenh @StuartDBrady @gwawiork, does this sound reasonable? Do you have a better suggestion? Thanks!
Need time, reviewing.
Sreelakshmi Haridas Maruthur Qualcomm
It appears that the spec implicitly limits the ballot builtin to 128 so the bitset definition seems correct and implementations must be limiting max workgroup size to 128 when using the builtin.
@lakshmih it is correct for the ballot tests, but it affects also non-ballot tests where the SG size can be larger.
Can we make the change to limit the work-group size (and hence the maximum sub-group size) to be smaller than 128 work-items for all of non-uniform subgroup tests? I haven't heard of a better proposal and this is almost certainly "more right" than these current changes which only increase the size of the bitset.
Re-pinging @svenh @StuartDBrady @gwawiork in case they missed this discussion previously...
Yes, makes sense to me.
I'm going to take the "focused review" off of this PR for now because next steps are clear. We can add it back when the changes described above are made. Thanks!
Could mobica craft a proper fix for this?