Halide
Halide copied to clipboard
Add GPU autoscheduler
This is a draft PR to merge the anderson2021 GPU autoscheduler (#5602). It includes the code (there is some overlap with adams2019, but it will probably be a substantial amount of work to deduplicate them), some tests, utility scripts for generating data/statistics/etc. (these can be removed if desired), and baseline weights (trained on a V100).
Step 1: please run run-clang-format.sh
and run-clang-tidy.sh
on these and fix the issues :-)
Some of the files (e.g. ASLog.h/.cpp
, PerfectHashMap.h
, and a few others) are identical and could be merged. Would autoschedulers/common
be the right place for them? Looks like ASLog.h/.cpp
are already there.
Unfortunately, most of the other files (e.g. Autoschedule.h/cpp
, LoopNest.h/cpp
, etc.) have diverged considerably and it would be a significant amount of work to merge them.
Some of the files (e.g.
ASLog.h/.cpp
,PerfectHashMap.h
, and a few others) are identical and could be merged. Wouldautoschedulers/common
be the right place for them? Looks likeASLog.h/.cpp
are already there.
Yep!
Unfortunately, most of the other files (e.g.
Autoschedule.h/cpp
,LoopNest.h/cpp
, etc.) have diverged considerably and it would be a significant amount of work to merge them.
No worries then.
I think I've now moved all the identical files to autoschedulers/common
. Are there other things that you think should be done or is this ready for review?
Are there other things that you think should be done or is this ready for review?
I think it's ready for review.
Just FYI, I am getting multiple build failures from the latest commit, using the commands:
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build -DTARGET_WEBASSEMBLY=OFF
$ cmake --build build -j17
Error log here: https://gist.github.com/vladl-innopeaktech/6539b811dc2c90bcb965660ea0e8b529
Using this patch https://gist.github.com/vladl-innopeaktech/37e5fc044479ca7186911dc140d71479 I am able to whittle down the errors to just some linker failure here:
[1/23] Linking CXX executable src/autoschedulers/anderson2021/anderson2021-test_apps_autoscheduler
FAILED: src/autoschedulers/anderson2021/anderson2021-test_apps_autoscheduler
: && /usr/bin/c++ -O3 -DNDEBUG src/autoschedulers/anderson2021/CMakeFiles/anderson2021-test_apps_autoscheduler.dir/test.cpp.o -o src/autoschedulers/anderson2021/anderson2021-test_apps_autoscheduler -Wl,-rpath,/home/vladl/gpu-autosched/Halide/build/src src/libHalide.so.15.0.0 -ldl -pthread && :
/usr/bin/ld: src/autoschedulers/anderson2021/CMakeFiles/anderson2021-test_apps_autoscheduler.dir/test.cpp.o: in function `main':
test.cpp:(.text.startup+0x5e1): undefined reference to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const'
/usr/bin/ld: test.cpp:(.text.startup+0xf49): undefined reference to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const'
/usr/bin/ld: test.cpp:(.text.startup+0x1af9): undefined reference to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const'
/usr/bin/ld: test.cpp:(.text.startup+0x25d1): undefined reference to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const'
/usr/bin/ld: test.cpp:(.text.startup+0x2cb3): undefined reference to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const'
/usr/bin/ld: src/autoschedulers/anderson2021/CMakeFiles/anderson2021-test_apps_autoscheduler.dir/test.cpp.o:test.cpp:(.text.startup+0x2f03): more undefined references to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const' follow
collect2: error: ld returned 1 exit status
I think I know what is going on -- I will apply your patch and then work on a fix
@aekul -- please sync this branch with Halide main, then apply the patch that @vladl-innopeaktech offered in the gist above; I will have an additional patch to apply on top of his (working on it now) but I don't have permission to push directly to your fork.
Please note: the CMake file uses add_halide_library(... TARGETS cmake)
in a lot of places; this cannot possibly be right for this scheduler, since it requires a GPU feature, but TARGETS cmake
is defined as The meta-triple
cmakeis equal to the
arch-bits-os of the current CMake target
... i.e., no GPU feature will ever be present.
OK, per my previous comment, after you sync this to main and apply the other patch, please apply https://gist.github.com/steven-johnson/263f0fafd34ebbac052873d6340ab56f on top of that -- it gets everything building an (apparently) working on my system (though some work surely remains to be done).
On top of getting it building and passing tests, I also (mostly) finished the work to remove use of env vars for settings, moving them into the AutoschedulerParams setup that is now used in mainline Halide. I think I got all this right.
(EDIT: oops, I forgot to update the .sh
files to set the params explicitly rather than setting the old env vars... I'll do that as a followup, or if you like, you can do so, using the scripts from Adams2019 as a model.)
Thanks for the patches. Synced with main and applied both of them.
Regarding TARGETS cmake
, is there a way to specify cuda
as part of the target? I've mostly tested the autoscheduler using the Makefile (i.e. using host-cuda
for the target) so I'm less familiar with Halide's use of cmake.
Thanks for the patches. Synced with main and applied both of them.
...are you sure? The GitHub page says there is an unresolved conflict in src/autoschedulers/adams2019/CMakeLists.txt
Regarding
TARGETS cmake
, is there a way to specifycuda
as part of the target?
adding @alexreinking as I don't know the answer here. Be aware, though, that adding cuda
specifically in the build file is problematic in that Cuda isn't available on all systems -- I presume this autoscheduler isn't cuda-specific and would work with e.g. Metal, D3D12Compute as well.
I've mostly tested the autoscheduler using the Makefile (i.e. using
host-cuda
for the target) so I'm less familiar with Halide's use of cmake.
CMake is the build system of record; Make support is only on a best-effort basis. I highly recommend that you consider working with the CMake build system to reduce the number of build glitches you encounter.
Testing the head I got a: 'Error: Constraint violated: head2_filter.extent.1 (73) == 39 (39)' Here the repro: https://gist.github.com/skhiat/fe363b516143ee9f47b7b13822e214fe // this code are using D3D12Compute, tested with CUDA same result For a trivial separated box-filter-3x3
Be aware, though, that adding
cuda
specifically in the build file is problematic in that Cuda isn't available on all systems -- I presume this autoscheduler isn't cuda-specific and would work with e.g. Metal, D3D12Compute as well.
The argument of interest to add_halide_library
is FEATURES cuda cuda_capability_XX
where XX
is the minimum CUDA capability number you need (e.g. 50
). But as @steven-johnson says, this isn't appropriate for the main build.
The autoscheduler itself presumably does not need CUDA to build. That's the part that belongs in src/autoscheduler/...
. Tests belong in tests/autoscheduler/<autoscheduler-name>
and may have additional dependencies.
That's the part that belongs in src/autoscheduler/.... Tests belong in tests/autoscheduler/
and may have additional dependencies.
That's beyond the scope of this PR, unfortunately. We should really refactor the other autoscheduler tests in this way first so that there's an existing framework to work with.
That's beyond the scope of this PR, unfortunately. We should really refactor the other autoscheduler tests in this way first so that there's an existing framework to work with.
Maybe so, but maybe that just means this PR has to wait until that refactoring is done. It will be at least as much work to add an optional cuda dependency here.
Fix locally: On src/autoschedulers/adams2019/CMakeLists.txt
- add_executable(adams2019_weightsdir_to_weightsfile ${COMMON_DIR}/weightsdir_to_weightsfile.cpp Weights.cpp)
+ add_executable(adams2019_weightsdir_to_weightsfile ${COMMON_DIR}/weightsdir_to_weightsfile.cpp ${COMMON_DIR}/Weights.cpp)
// I don't have the right to push on this branch
Any more updates on this? Years in the making here :-) Would love to see this land some day. Thanks!
Looks like the test reorganization mentioned above has been done. I'll try to take another look at this in the next few days.
@aekul thanks! It would help me out a ton if it landed.
@aekul It looks like the autotuning script and the retrain_cost_model
binary are out-of-sync. I've tried to bridge the gap with the arguments but it looks like retrain_cost_model
just gets stuck on loading samples forever. Are they very far out of sync? (i.e. should I keep trying to get it working or leave it to you?)
Haven't had a chance to test the scripts since my last commits but I'll take a look soon.
Ok. My team is very interested in getting the autotuning running and I've been playing with this branch for the last couple of months so if there's anything you'd like to delegate (esp wrt the build system) in this PR, I can get time to work on it.
The one issue I know of with the autotuning scripts is that they use the Makefiles to build Halide and the autoscheduler. So they likely need to be ported to use CMake. Does that sound like the issue you are encountering or is it something else?
I've updated the autotuning/data generation scripts to use the CMake build and tidied some things up. It should now be possible to build everything with CMake and run the scripts to autotune a generator. There are 3 things to be aware of:
-
autotune_loop.sh
linksRunGenMain.o
into each benchmark binary separately from the output of the autoscheduler. I didn't see a way to buildRunGenMain.o
with CMake so I built it with the Makefiles and copied to topath/to/build/tools
- If you want to autotune the apps, they need to be compiled. The root
CMakeLists.txt
does not automatically compile the apps. I addedadd_subdirectory(apps)
to it and had to comment outadd_app(hannk)
fromapps/CMakeLists.txt
. I haven't checked in these changes because I'm not sure ifadd_subdirectory(apps)
was omitted intentionally. - There is currently a buildbot error that
included_schedule_file.schedule.h
does not exist. I can't currently check it in because a rule in.gitignore
is preventing it. So in order to build things, it's currently necessary to add this file locally (e.g. by copying it from the correspondingadams2019
directory).
@steven-johnson Do you have thoughts/suggestions on any of the above, particularly for (3) and whether I should override that rule and check in the file or do something else?
I didn't see a way to build RunGenMain.o with CMake
Link to Halide::RunGenMain
.
2. I haven't checked in these changes because I'm not sure if
add_subdirectory(apps)
was omitted intentionally.
It is omitted intentionally.
I can't currently check it in because a rule in .gitignore is preventing it.
You can override with git add -f
(force). That rule exists to prevent accidentally checking in compiler outputs.
Please get the buildbots green before requesting a review :-)
@steven-johnson One of the tests for Mullapudi2016 (mullapudi2016_reorder
) is failing. Looks like it's a performance test but I don't see anything in this PR that would impact that. Has this kind of thing happened before?
Hello I tested a simple Histogram with this branch:
H::ImageParam src{ H::UInt(8), 3, "src" };
H::RDom imgDom(src);
H::Var z("z");
hist(z) = H::cast<UInt32>(0);
hist(H::clamp(H::cast<Int32>(src(imgDom[0], imgDom[1], imgDom[2])), 0, 255)) += H::cast<UInt32>(1);
Anderson2021 autoschedule give me:
Best cost: 6.38469
auto pipeline = get_pipeline();
Func hist = pipeline.get_func(1);
Var z(hist.get_schedule().dims()[0].var);
Var zi("zi");
RVar src_x(hist.update(0).get_schedule().dims()[0].var);
RVar src_y(hist.update(0).get_schedule().dims()[1].var);
RVar src_z(hist.update(0).get_schedule().dims()[2].var);
Var zi_serial_outer("zi_serial_outer");
hist.update(0)
.reorder(src_x, src_y, src_z)
.gpu_single_thread();
hist
.split(z, z, zi, 32, TailStrategy::ShiftInwards)
.compute_root()
.reorder(zi, z)
.gpu_blocks(z)
.split(zi, zi_serial_outer, zi, 32, TailStrategy::GuardWithIf)
.gpu_threads(zi);
produce hist:
gpu_block z.z<Default_GPU>:
gpu_thread z.zi.zi in [0, 31]<Default_GPU>:
hist(...) = ...
gpu_block __outermost.__outermost.v39 in [0, 0]<Default_GPU>:
gpu_thread __outermost.v40 in [0, 0]<Default_GPU>:
for src:
for src:
for src:
hist(...) = ...
And Li2018 give me:
hist.compute_root()
.split(z,v3,v4,64,ShiftInwards)
.reorder(v4,v3)
.gpu_blocks(v3)
.gpu_threads(v4)
;
hist.update(0)
.split(src$y,r18,r19,48,GuardWithIf)
.split(src$z,r20,r21,40,GuardWithIf)
;
hist_intm = hist.update(0)
.rfactor({{r18,v11},{r20,v12}})
.compute_root()
.split(v11,v11,v21,32,GuardWithIf)
.fuse(v11,v12,v11)
.reorder(v21,v11)
.gpu_blocks(v11)
.gpu_threads(v21)
;
hist_intm.update()
.split(v11,v11,v22,32,GuardWithIf)
.fuse(v11,v12,v11)
.fuse(src$x,r19,src$x)
.fuse(src$x,r21,src$x)
.reorder(v22,v11,src$x)
.gpu_blocks(v11)
.atomic()
.gpu_blocks(src$x)
.gpu_threads(v22)
hist.update(0)
.split(z,v31,v32,64,GuardWithIf)
.reorder(r18,r20,v32,v31)
.gpu_blocks(v31)
.gpu_threads(v32)
;
src_im.compute_root()
.split(_1,v35,v36,64,ShiftInwards)
.fuse(_0,v35,_0)
.fuse(_0,_2,_0)
.reorder(v36,_0)
.reorder_storage(_1,_0,_2)
.gpu_blocks(_0)
.gpu_threads(v36)
;
produce src_im:
gpu_block _0._0._0<Default_GPU>:
gpu_thread _1.v36 in [0, 63]<Default_GPU>:
src_im(...) = ...
consume src_im:
produce hist_intm:
gpu_block v11.v11.v11<Default_GPU>:
gpu_thread v11.v21 in [0, 31]<Default_GPU>:
for z:
hist_intm(...) = ...
gpu_block src.src.src<Default_GPU>:
gpu_block v11.v11.v11<Default_GPU>:
gpu_thread v11.v22 in [0, 31]<Default_GPU>:
hist_intm(...) = ...
consume hist_intm:
produce hist:
gpu_block z.v3<Default_GPU>:
gpu_thread z.v4 in [0, 63]<Default_GPU>:
hist(...) = ...
gpu_block z.v31<Default_GPU>:
gpu_thread z.v32 in [0, 63]<Default_GPU>:
for src.r20:
for src.r18:
hist(...) = ...
What surprise me is Anderson2021 generate 1x1 Dispatch in a single thread and Li2018 generate proper parallel compute with multiple thread.
The option I used for Anderson2021:
{ "randomize_tilings", "1" },
{ "search_space_options", "1111" },
{ "num_passes", "100" },
{ "shared_memory_limit_kb", "64" },
{ "shared_memory_sm_limit_kb", "64" },