Halide
Halide copied to clipboard
GPU autoscheduling with Mullapdui2016: the reference implementation
Rationale:
-
To compare the GPU auto-scheduling performance of
Mullapudi2016
againstLi2018
andAnderson2021
. -
To reduce the following claims to practice, quoting the original Mullapudi2016 article:
Portability to Different Architectures: GPU Performance: The inlining, tiling, and grouping processes are otherwise similar to the CPU case. Groups resulting from merging are mapped to CUDA kernels by designating the outer tile loops as GPU block grid dimensions and the inner tile loops as GPU thread block dimensions. All intermediate buffers within a group are allocated in GPU shared memory
- To implement the so-call "single level tiling only" limitation in the
Mullapudi2016
andSioutas2020
algorithms, according to the findings in the Anderson2021 paper:
[Mullapudi et al] develops an automatic scheduling technique using a heuristic cost model and a greedy stage grouping algorithm... but its search space is smaller compared to ours among other reasons because it only supports a single level of tiling, and as we discuss in Section 6.2, this excludes a number of high performance schedules.
Change summary:
Reverse engineer the GPU scheduling feature as stated in Section 5.4 of Mullapudi's article:
Mullapudi, Adams, Sharlet, Ragan-Kelley, Fatahalian. Automatically scheduling Halide image processing pipelines. ACM Transactions on Graphics, 35(4), 83pp 1–11. https://doi.org/10.1145/2897824.2925952
When target=cuda
is detected in the code generator command line arguments, intercept all vectorize
, parallel
scheduling calls requested by the auto-vectorization algorithm and the auto-parallelization algo with the class GPUTilingDedup
for deferred execution.
Implement the class GPUTilingDedup
to ensure all Halide gpu schedule calls are idempotent: no matter how many times the Stage is vectorized
, reordered
, parallel
, and then reordered
again, the reorder
and gpu_threads()
schedules are called exactly once.
Also, intercept all split
and reorder
scheduling calls by Mullapudi's auto-splitting algorithm.
Implement the clss GPUTileHelper
to enforce atomic transaction of the gpu schedules. If the current stage is compute_root
, mark all auto-split inner dimensions as gpu_threads
, and outer dimensions as gpu_blocks
. If the Stage is compute_at
another Stage, mark all vectorize
dimensions as gpu_threads
.
If auto-splitting of the current stage does not result in any tile, implement a rudimentary tiling having tile size = vector_length x parallel_factor.
If Mullapudi does not call any split, vectorize, or parallel schedules, assume scalar reduction routine. Implement it on the GPU via single_thread
.
cc'ed @aekul , @jrk, @abadams .
See also: https://github.com/halide/Halide/issues/7491
Thanks for this! IIRC the original GPU version of this autoscheduler was what we charitably describe as "research code", and was never fit for production.
As this is an attempted reconstruction of his GPU autoscheduler, I should probably tag @ravi-teja-mullapudi to see if this looks sane, because this will affect how people cite and compare to his work in future.
Several bot failures with:
/home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-32-linux-make/halide-source/src/autoschedulers/mullapudi2016/AutoSchedule.cpp:2830:21: error: unused variable ‘types’ [-Werror=unused-variable]
Several bot failures with:
/home/halidenightly/build_bot/worker/halide-testbranch-main-llvm18-x86-32-linux-make/halide-source/src/autoschedulers/mullapudi2016/AutoSchedule.cpp:2830:21: error: unused variable ‘types’ [-Werror=unused-variable]
Done removing the offending line. I also rebased the changes on top of main
.
Update: perhaps we need a separate PR to check for unused variables in the CMake configs:
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 47e90864d..83ded47a1 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -587,6 +587,8 @@ target_compile_options(
$<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-function>
$<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-macros>
$<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-parameter>
+ $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-variable>
+ $<$<CXX_COMPILER_ID:GNU,Clang,AppleClang>:-Wno-unused-const-variable>
$<$<CXX_COMPILER_ID:Clang,AppleClang>:-Wno-c++98-compat-pedantic>
$<$<CXX_COMPILER_ID:Clang,AppleClang>:-Wno-c++98-compat>
@steven-johnson and @abadams , thank you for testing the PR on the CI. Yes, the failure is triggered by the CMake build option -DHalide_TARGET=host-[metal|gpu]
. I didn't know we can do that. I like the feature. I will reproduce it on my machine.
There are two types of generator failures:
Functions that are compute_at() a gpu_block() loop must specify the innermost gpu_block() loop for that Func.
It stems from the over-estimated "L2 cache size per thread" machine parameter; the value should have been ~70kB instead of 16MB. It is described in the original paper as a limitation, not a bug.
But yeah, we should have a better exception handling mechanism for this actionable error. I need help to improve the user experience.
Another generator failure: Functions that are compute_at() a gpu_block() loop cannot have their own gpu_block() loops
. It happens in scalar reduction stages scheduled in compute_at
. Resolving the bug...
Updated to main branch to fix OSX WebGPU failures
Update: The GPU scheduling extension for Mullapudi2016 passes all Buildbot tests except for autograd_grad.generator
and local_laplacian_generator
.
-
autograd_grad
passes the Buildbot tests, but the unamedVar x
triggersbasic_string::_M_construct == null
error on LLVM16;!name.empty()
error on LLVM18.
https://github.com/halide/Halide/blob/69c75b34767dff5572cf52c8b75596804804c283/test/generator/autograd_generator.cpp#L23
-
local_laplacian_generator
triggers a subtle!name.empty()
exception in the Halide IR.
@abadams Yeah I agreed the Buildbot CI jobs ensure production quality auto-schedulers, which is not original goal of the Mullapudi2016's GPU extensions. I will switch this PR to a draft, and work on issue 2 later next week.