Halide Add GPU autoscheduler

This is a draft PR to merge the anderson2021 GPU autoscheduler (#5602). It includes the code (there is some overlap with adams2019, but it will probably be a substantial amount of work to deduplicate them), some tests, utility scripts for generating data/statistics/etc. (these can be removed if desired), and baseline weights (trained on a V100).

Jul 19 '22 06:07 aekul

Step 1: please run run-clang-format.sh and run-clang-tidy.sh on these and fix the issues :-)

Jul 19 '22 18:07 steven-johnson

Some of the files (e.g. ASLog.h/.cpp, PerfectHashMap.h, and a few others) are identical and could be merged. Would autoschedulers/common be the right place for them? Looks like ASLog.h/.cpp are already there.

Unfortunately, most of the other files (e.g. Autoschedule.h/cpp, LoopNest.h/cpp, etc.) have diverged considerably and it would be a significant amount of work to merge them.

Aug 18 '22 05:08 aekul

Some of the files (e.g. ASLog.h/.cpp, PerfectHashMap.h, and a few others) are identical and could be merged. Would autoschedulers/common be the right place for them? Looks like ASLog.h/.cpp are already there.

Yep!

Unfortunately, most of the other files (e.g. Autoschedule.h/cpp, LoopNest.h/cpp, etc.) have diverged considerably and it would be a significant amount of work to merge them.

No worries then.

Aug 18 '22 17:08 steven-johnson

I think I've now moved all the identical files to autoschedulers/common. Are there other things that you think should be done or is this ready for review?

Aug 22 '22 06:08 aekul

Are there other things that you think should be done or is this ready for review?

I think it's ready for review.

Aug 22 '22 16:08 steven-johnson

Just FYI, I am getting multiple build failures from the latest commit, using the commands:

$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build -DTARGET_WEBASSEMBLY=OFF
$ cmake --build build -j17

Error log here: https://gist.github.com/vladl-innopeaktech/6539b811dc2c90bcb965660ea0e8b529

Sep 08 '22 18:09 vladl-innopeaktech

Using this patch https://gist.github.com/vladl-innopeaktech/37e5fc044479ca7186911dc140d71479 I am able to whittle down the errors to just some linker failure here:

[1/23] Linking CXX executable src/autoschedulers/anderson2021/anderson2021-test_apps_autoscheduler
FAILED: src/autoschedulers/anderson2021/anderson2021-test_apps_autoscheduler 
: && /usr/bin/c++  -O3 -DNDEBUG   src/autoschedulers/anderson2021/CMakeFiles/anderson2021-test_apps_autoscheduler.dir/test.cpp.o  -o src/autoschedulers/anderson2021/anderson2021-test_apps_autoscheduler  -Wl,-rpath,/home/vladl/gpu-autosched/Halide/build/src  src/libHalide.so.15.0.0  -ldl  -pthread && :
/usr/bin/ld: src/autoschedulers/anderson2021/CMakeFiles/anderson2021-test_apps_autoscheduler.dir/test.cpp.o: in function `main':
test.cpp:(.text.startup+0x5e1): undefined reference to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const'
/usr/bin/ld: test.cpp:(.text.startup+0xf49): undefined reference to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const'
/usr/bin/ld: test.cpp:(.text.startup+0x1af9): undefined reference to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const'
/usr/bin/ld: test.cpp:(.text.startup+0x25d1): undefined reference to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const'
/usr/bin/ld: test.cpp:(.text.startup+0x2cb3): undefined reference to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const'
/usr/bin/ld: src/autoschedulers/anderson2021/CMakeFiles/anderson2021-test_apps_autoscheduler.dir/test.cpp.o:test.cpp:(.text.startup+0x2f03): more undefined references to `Halide::Pipeline::auto_schedule(Halide::Target const&, Halide::MachineParams const&) const' follow
collect2: error: ld returned 1 exit status

Sep 08 '22 20:09 vladl-innopeaktech

I think I know what is going on -- I will apply your patch and then work on a fix

Sep 08 '22 20:09 steven-johnson

@aekul -- please sync this branch with Halide main, then apply the patch that @vladl-innopeaktech offered in the gist above; I will have an additional patch to apply on top of his (working on it now) but I don't have permission to push directly to your fork.

Sep 08 '22 21:09 steven-johnson

Please note: the CMake file uses add_halide_library(... TARGETS cmake) in a lot of places; this cannot possibly be right for this scheduler, since it requires a GPU feature, but TARGETS cmake is defined as The meta-triple cmakeis equal to thearch-bits-os of the current CMake target ... i.e., no GPU feature will ever be present.

Sep 08 '22 21:09 steven-johnson

OK, per my previous comment, after you sync this to main and apply the other patch, please apply https://gist.github.com/steven-johnson/263f0fafd34ebbac052873d6340ab56f on top of that -- it gets everything building an (apparently) working on my system (though some work surely remains to be done).

On top of getting it building and passing tests, I also (mostly) finished the work to remove use of env vars for settings, moving them into the AutoschedulerParams setup that is now used in mainline Halide. I think I got all this right.

(EDIT: oops, I forgot to update the .sh files to set the params explicitly rather than setting the old env vars... I'll do that as a followup, or if you like, you can do so, using the scripts from Adams2019 as a model.)

Sep 09 '22 00:09 steven-johnson

Thanks for the patches. Synced with main and applied both of them.

Regarding TARGETS cmake, is there a way to specify cuda as part of the target? I've mostly tested the autoscheduler using the Makefile (i.e. using host-cuda for the target) so I'm less familiar with Halide's use of cmake.

Sep 09 '22 06:09 aekul

Thanks for the patches. Synced with main and applied both of them.

...are you sure? The GitHub page says there is an unresolved conflict in src/autoschedulers/adams2019/CMakeLists.txt

Regarding TARGETS cmake, is there a way to specify cuda as part of the target?

adding @alexreinking as I don't know the answer here. Be aware, though, that adding cuda specifically in the build file is problematic in that Cuda isn't available on all systems -- I presume this autoscheduler isn't cuda-specific and would work with e.g. Metal, D3D12Compute as well.

I've mostly tested the autoscheduler using the Makefile (i.e. using host-cuda for the target) so I'm less familiar with Halide's use of cmake.

CMake is the build system of record; Make support is only on a best-effort basis. I highly recommend that you consider working with the CMake build system to reduce the number of build glitches you encounter.

Sep 09 '22 16:09 steven-johnson

Testing the head I got a: 'Error: Constraint violated: head2_filter.extent.1 (73) == 39 (39)' Here the repro: https://gist.github.com/skhiat/fe363b516143ee9f47b7b13822e214fe // this code are using D3D12Compute, tested with CUDA same result For a trivial separated box-filter-3x3

Sep 09 '22 16:09 skhiat

Be aware, though, that adding cuda specifically in the build file is problematic in that Cuda isn't available on all systems -- I presume this autoscheduler isn't cuda-specific and would work with e.g. Metal, D3D12Compute as well.

The argument of interest to add_halide_library is FEATURES cuda cuda_capability_XX where XX is the minimum CUDA capability number you need (e.g. 50). But as @steven-johnson says, this isn't appropriate for the main build.

The autoscheduler itself presumably does not need CUDA to build. That's the part that belongs in src/autoscheduler/.... Tests belong in tests/autoscheduler/<autoscheduler-name> and may have additional dependencies.

Sep 09 '22 17:09 alexreinking

That's the part that belongs in src/autoscheduler/.... Tests belong in tests/autoscheduler/ and may have additional dependencies.

That's beyond the scope of this PR, unfortunately. We should really refactor the other autoscheduler tests in this way first so that there's an existing framework to work with.

Sep 09 '22 17:09 steven-johnson

That's beyond the scope of this PR, unfortunately. We should really refactor the other autoscheduler tests in this way first so that there's an existing framework to work with.

Maybe so, but maybe that just means this PR has to wait until that refactoring is done. It will be at least as much work to add an optional cuda dependency here.

Sep 09 '22 17:09 alexreinking

Fix locally: On src/autoschedulers/adams2019/CMakeLists.txt

- add_executable(adams2019_weightsdir_to_weightsfile ${COMMON_DIR}/weightsdir_to_weightsfile.cpp Weights.cpp)
+ add_executable(adams2019_weightsdir_to_weightsfile ${COMMON_DIR}/weightsdir_to_weightsfile.cpp ${COMMON_DIR}/Weights.cpp)

// I don't have the right to push on this branch

Sep 12 '22 00:09 skhiat

Any more updates on this? Years in the making here :-) Would love to see this land some day. Thanks!

Nov 17 '22 20:11 ryanstout

Looks like the test reorganization mentioned above has been done. I'll try to take another look at this in the next few days.

Nov 20 '22 06:11 aekul

@aekul thanks! It would help me out a ton if it landed.

Nov 22 '22 00:11 ryanstout

@aekul It looks like the autotuning script and the retrain_cost_model binary are out-of-sync. I've tried to bridge the gap with the arguments but it looks like retrain_cost_model just gets stuck on loading samples forever. Are they very far out of sync? (i.e. should I keep trying to get it working or leave it to you?)

Dec 09 '22 02:12 vladl-innopeaktech

Haven't had a chance to test the scripts since my last commits but I'll take a look soon.

Dec 09 '22 03:12 aekul

Ok. My team is very interested in getting the autotuning running and I've been playing with this branch for the last couple of months so if there's anything you'd like to delegate (esp wrt the build system) in this PR, I can get time to work on it.

Dec 09 '22 19:12 vladl-innopeaktech

The one issue I know of with the autotuning scripts is that they use the Makefiles to build Halide and the autoscheduler. So they likely need to be ported to use CMake. Does that sound like the issue you are encountering or is it something else?

Dec 10 '22 00:12 aekul

I've updated the autotuning/data generation scripts to use the CMake build and tidied some things up. It should now be possible to build everything with CMake and run the scripts to autotune a generator. There are 3 things to be aware of:

autotune_loop.sh links RunGenMain.o into each benchmark binary separately from the output of the autoscheduler. I didn't see a way to build RunGenMain.o with CMake so I built it with the Makefiles and copied to to path/to/build/tools
If you want to autotune the apps, they need to be compiled. The root CMakeLists.txt does not automatically compile the apps. I added add_subdirectory(apps) to it and had to comment out add_app(hannk) from apps/CMakeLists.txt. I haven't checked in these changes because I'm not sure if add_subdirectory(apps) was omitted intentionally.
There is currently a buildbot error that included_schedule_file.schedule.h does not exist. I can't currently check it in because a rule in .gitignore is preventing it. So in order to build things, it's currently necessary to add this file locally (e.g. by copying it from the corresponding adams2019 directory).

@steven-johnson Do you have thoughts/suggestions on any of the above, particularly for (3) and whether I should override that rule and check in the file or do something else?

Jan 02 '23 06:01 aekul

I didn't see a way to build RunGenMain.o with CMake

Link to Halide::RunGenMain.

2. I haven't checked in these changes because I'm not sure if add_subdirectory(apps) was omitted intentionally.

It is omitted intentionally.

I can't currently check it in because a rule in .gitignore is preventing it.

You can override with git add -f (force). That rule exists to prevent accidentally checking in compiler outputs.

Jan 02 '23 06:01 alexreinking

Please get the buildbots green before requesting a review :-)

Jan 11 '23 17:01 steven-johnson

@steven-johnson One of the tests for Mullapudi2016 (mullapudi2016_reorder) is failing. Looks like it's a performance test but I don't see anything in this PR that would impact that. Has this kind of thing happened before?

Jan 17 '23 05:01 aekul

Hello I tested a simple Histogram with this branch:

H::ImageParam	src{ H::UInt(8), 3, "src" };

H::RDom imgDom(src);
H::Var z("z");
hist(z) = H::cast<UInt32>(0);
hist(H::clamp(H::cast<Int32>(src(imgDom[0], imgDom[1], imgDom[2])), 0, 255)) += H::cast<UInt32>(1);

Anderson2021 autoschedule give me:

Best cost: 6.38469
auto pipeline = get_pipeline();
Func hist = pipeline.get_func(1);
Var z(hist.get_schedule().dims()[0].var);
Var zi("zi");
RVar src_x(hist.update(0).get_schedule().dims()[0].var);
RVar src_y(hist.update(0).get_schedule().dims()[1].var);
RVar src_z(hist.update(0).get_schedule().dims()[2].var);
Var zi_serial_outer("zi_serial_outer");
hist.update(0)
    .reorder(src_x, src_y, src_z)
    .gpu_single_thread();
hist
    .split(z, z, zi, 32, TailStrategy::ShiftInwards)
    .compute_root()
    .reorder(zi, z)
    .gpu_blocks(z)
    .split(zi, zi_serial_outer, zi, 32, TailStrategy::GuardWithIf)
    .gpu_threads(zi);

produce hist:
  gpu_block z.z<Default_GPU>:
    gpu_thread z.zi.zi in [0, 31]<Default_GPU>:
      hist(...) = ...
  gpu_block __outermost.__outermost.v39 in [0, 0]<Default_GPU>:
    gpu_thread __outermost.v40 in [0, 0]<Default_GPU>:
      for src:
        for src:
          for src:
            hist(...) = ...

And Li2018 give me:

hist.compute_root()
    .split(z,v3,v4,64,ShiftInwards)
    .reorder(v4,v3)
    .gpu_blocks(v3)
    .gpu_threads(v4)
;
hist.update(0)
    .split(src$y,r18,r19,48,GuardWithIf)
    .split(src$z,r20,r21,40,GuardWithIf)
;
hist_intm = hist.update(0)
    .rfactor({{r18,v11},{r20,v12}})
    .compute_root()
    .split(v11,v11,v21,32,GuardWithIf)
    .fuse(v11,v12,v11)
    .reorder(v21,v11)
    .gpu_blocks(v11)
    .gpu_threads(v21)
;
hist_intm.update()
    .split(v11,v11,v22,32,GuardWithIf)
    .fuse(v11,v12,v11)
    .fuse(src$x,r19,src$x)
    .fuse(src$x,r21,src$x)
    .reorder(v22,v11,src$x)
    .gpu_blocks(v11)
    .atomic()
    .gpu_blocks(src$x)
    .gpu_threads(v22)
hist.update(0)
    .split(z,v31,v32,64,GuardWithIf)
    .reorder(r18,r20,v32,v31)
    .gpu_blocks(v31)
    .gpu_threads(v32)
;
src_im.compute_root()
    .split(_1,v35,v36,64,ShiftInwards)
    .fuse(_0,v35,_0)
    .fuse(_0,_2,_0)
    .reorder(v36,_0)
    .reorder_storage(_1,_0,_2)
    .gpu_blocks(_0)
    .gpu_threads(v36)
;

produce src_im:
  gpu_block _0._0._0<Default_GPU>:
    gpu_thread _1.v36 in [0, 63]<Default_GPU>:
      src_im(...) = ...
consume src_im:
  produce hist_intm:
    gpu_block v11.v11.v11<Default_GPU>:
      gpu_thread v11.v21 in [0, 31]<Default_GPU>:
        for z:
          hist_intm(...) = ...
    gpu_block src.src.src<Default_GPU>:
      gpu_block v11.v11.v11<Default_GPU>:
        gpu_thread v11.v22 in [0, 31]<Default_GPU>:
          hist_intm(...) = ...
  consume hist_intm:
    produce hist:
      gpu_block z.v3<Default_GPU>:
        gpu_thread z.v4 in [0, 63]<Default_GPU>:
          hist(...) = ...
      gpu_block z.v31<Default_GPU>:
        gpu_thread z.v32 in [0, 63]<Default_GPU>:
          for src.r20:
            for src.r18:
              hist(...) = ...

What surprise me is Anderson2021 generate 1x1 Dispatch in a single thread and Li2018 generate proper parallel compute with multiple thread.

The option I used for Anderson2021:

{ "randomize_tilings", "1" },
{ "search_space_options", "1111" },
{ "num_passes", "100" },
{ "shared_memory_limit_kb", "64" },
{ "shared_memory_sm_limit_kb", "64" },

Feb 09 '23 10:02 soufianekhiat

Halide Halide copied to clipboard

Add GPU autoscheduler

Halide
Halide copied to clipboard