iree [BUILD] change cuda and hip to default off

As discussed in https://github.com/iree-org/iree/pull/17564 the default build should not include targets that require an SDK. This PR disables CUDA and ROCM by default.

Jul 24 '24 13:07 maxbartel

@ScottTodd You are changing a lot of this scripts at the moment right? How much do you still have to change? Can you give me a hint what needs to be updated afterwards? I would like to get this in soonish, because the Aarch64 CUDA package is regarded as a follow-up and I want to unblock this path.

Aug 02 '24 15:08 maxbartel

@ScottTodd You are changing a lot of this scripts at the moment right? How much do you still have to change? Can you give me a hint what needs to be updated afterwards? I would like to get this in soonish, because the Aarch64 CUDA package is regarded as a follow-up and I want to unblock this path.

Yep, sorry for not giving this much (visible) attention. It's still in my queue and top of mind.

I've been mainly organizing the CI and PkgCI workflows so we have fewer builds requiring large self-hosted runner VMs and we build release packages on host platforms then test packages across targets. This is adjacent to some of that and I expect it will be easier to manage once the dust settles a bit more (~ 1 week?). We could try to push through both at the same time, but I'd prefer to focus my attention and limit the surface area of builds to debug.

Here's how I think this can work out:

I finish migrating jobs to pkgci
I clean up old build scripts (CI workflows should either be explicit about what commands they are running or should use CMake presets - right now they go through these build_all.sh style scripts, which you're editing in this PR, that pass environment variables around)
We flip the CMake defaults for more optional components to disabled, updating documentation, scripts, and workflows to enable the components that are needed

Steps 2-3 are still a bit unclear in my head right now... there are too many indirection layers to make those changes with confidence right now.

We also have some macOS py-runtime-pkg release builds failing since they are including Vulkan code and that somehow broke recently. In that case we can have Vulkan default to disabled and then decide explicitly if it should be enabled there, much like the decision you'd like to make to enable CUDA for arm jetson boards.

Aug 02 '24 16:08 ScottTodd

Sound good. Let's get your changes in first. After editing them I am not a huge fand of the build_all.sh scripts... They are also shared between macos and linux, which makes this PR bigger than it needed to be.

Aug 05 '24 10:08 maxbartel

I've started deleting or limiting the use of the build_tools/cmake/build_*.sh scripts. One thing I'm not fully sure of in the context of this work is how we want to handle opting CI jobs in to build these components once they are made optional.

Consider the case of the 'runtime' job matrix: https://github.com/iree-org/iree/blob/a7e578845ba120c3ac452ddff67303c8772aa5b6/.github/workflows/ci.yml#L151-L216

should the macos build there opt-in to -DIREE_HAL_DRIVER_METAL=ON? If so, should that job be split out from being a matrix, or should that setting somehow get passed through to only one platform? The script was the default place to put such logic (either via automatic detection or environment variable plumbing). A replacement could be a set of presets in https://github.com/iree-org/iree/tree/main/build_tools/cmake/presets. The matrix could choose a specific preset for each platform, and those preset would list the options they want to enable.

Aug 08 '24 23:08 ScottTodd

Sorry, refactoring the scripts and workflows is taking longer than originally anticipated (shocker 😅). We could try this again rebased on the latest scripts, but there will still be some plumbing through env vars and confusing indirection to watch out for.

We also have a few breaking changes in flight to batch together into a stable release. This sort of change could also be included in that batch.

Aug 19 '24 22:08 ScottTodd

No worries! Disabling stuff in the default build does not change the release imo. However the AArch64 CUDA backend would. Maybe we can get that at least in? Would be separete from this.

I am going to take a look at cmake presets. Can you maybe explain what Because these can interact badly with tools from the README refers to?

Aug 20 '24 08:08 maxbartel

No worries! Disabling stuff in the default build does not change the release imo. However the AArch64 CUDA backend would. Maybe we can get that at least in? Would be separete from this.

We should end up with the same set of available compiler target backends and runtime HAL drivers in the distributed release artifacts, yes. Any developers building from source will need to opt-in and enable the features that they want though, and that's what I'd like to batch with other breaking changes. Release notes (if we finally write them this time 😅) would then call that out as something developers updating would want to be aware of.

I am going to take a look at cmake presets. Can you maybe explain what Because these can interact badly with tools from the README refers to?

With VSCode in particular, see https://github.com/microsoft/vscode-cmake-tools/blob/main/docs/cmake-presets.md#enable-cmakepresetsjson-in-the-cmake-tools-extension. The default behavior when a CMakePresets.json file exists is to use that preset file. Developers can opt-out and change the cmake.useCMakePresets setting from auto/always to never, but we want good default settings out of the box. It is tricky to define presets for this sort of project that are generally applicable to all developers, so we've been concerned that having them used by default would be confusing or artificially limiting. See also https://github.com/iree-org/iree/issues/5804#issuecomment-1502080408

Aug 20 '24 15:08 ScottTodd

Hmm yeah you have a point about the release notes. When is that release planned?

Regarding the presets, I think we converged to a minimal build without most backends, didn't we? They have things like conditionals in newer versions. I will try them out to see how that works and give feedback hopefully at the end of this week.

Aug 20 '24 17:08 maxbartel

Hmm yeah you have a point about the release notes. When is that release planned?

We've been trying to promote a nightly release to stable around once a month, and the last release was ~1 month ago, so soon. I'm expecting that we'll try to formalize the process more over the coming weeks/months.

Regarding the presets, I think we converged to a minimal build without most backends, didn't we? They have things like conditionals in newer versions. I will try them out to see how that works and give feedback hopefully at the end of this week.

Yep. We have a few presets like this "minimal" one already: https://github.com/iree-org/iree/blob/0247962ff4bfe203fa8c727067b8bbb56a4a2055/build_tools/cmake/presets/options.json#L11-L15

Any feedback or improvements on the setup there would be appreciated.

Aug 20 '24 18:08 ScottTodd

@ScottTodd So I played around with presets a bit and I think the ergonomics from the command line are ok. However I feel like the behavior with the cmake extension is a bit weird. I am not using that extension for other reasons (no color output, automatic configuration, etc.), so I am probably the wrong person to ask here. Also the conditionals depend on env variables, which wouldn't be better than now...

It feels like we would end up with a python script to create user presets and I am not sure if that would be really better than just using something like python to begin with...

Aug 23 '24 16:08 maxbartel

@ScottTodd I am in the process of updating the PR. I just wanted to ping you because the CUDA tests for example are marked as passed even though they didn't run... Is there a way to mark them as required?

Aug 26 '24 12:08 maxbartel

@ScottTodd I am in the process of updating the PR. I just wanted to ping you because the CUDA tests for example are marked as passed even though they didn't run... Is there a way to mark them as required?

Are you referring to these? https://github.com/iree-org/iree/actions/runs/10562806742/job/29262093188?pr=17996#step:11:75

        Start  133: iree/tests/e2e/linalg/check_winograd_vulkan-spirv_vulkan_conv2d.mlir
 21/403 Test  #133: iree/tests/e2e/linalg/check_winograd_vulkan-spirv_vulkan_conv2d.mli ............   Passed    0.58 sec
        Start  134: iree/tests/e2e/linalg/check_large_linalg_matmul_cuda_conv2d.mlir
 22/403 Test  #134: iree/tests/e2e/linalg/check_large_linalg_matmul_cuda_conv2d.mli ................***Not Run (Disabled)   0.00 sec
        Start  135: iree/tests/e2e/linalg/check_large_linalg_matmul_cuda_fp_to_subbyte.mlir
 23/403 Test  #135: iree/tests/e2e/linalg/check_large_linalg_matmul_cuda_fp_to_subbyte.mli .........***Not Run (Disabled)   0.00 sec

The logic for that is here: https://github.com/iree-org/iree/blob/7a7bfe1807085804863fad396ed0b2942f72467d/build_tools/cmake/iree_check_test.cmake#L113-L150

I set it up that way to show which tests could run, if the backends were enabled. One of my goals for https://github.com/iree-org/iree-test-suites is to make the tests explicit and obvious. That's hard when the tests are so tightly linked with CMake build settings (including cross-compilation). In the separate repo we can define pytest labels/marks/filters and CMake options unique to the test suites.

Aug 26 '24 16:08 ScottTodd

@ScottTodd I think this PR is now ready. Enabling CUDA for Aarch64 is now also just a change in an if-statement.

Aug 27 '24 09:08 maxbartel

Thanks for the review! I am OOO until Monday and cannot update until then. Just fyi because we also talked about the release this week.

Aug 27 '24 21:08 maxbartel

@ScottTodd I will merge now, but it is also end of day in Germany. Please revert if this breaks something.

Sep 04 '24 15:09 maxbartel