iree Run unit tests on AMDGPU CI

We have two main CI pipelines:

https://github.com/iree-org/iree/blob/main/.github/workflows/ci.yml
https://github.com/iree-org/iree/blob/main/.github/workflows/pkgci.yml

Pipeline overview

ci.yml

The main testing flows for hardware targets are:

build_all -> test_nvidia_a100, test_nvidia_gpu
build_all -> cross_compile_and_test [riscv_64, riscv_32, emscripten]
build_all -> build_and_test_android [moto edge, pixel 6 pro]

The test jobs cover these areas:

Building the runtime from source
Building iree-test-deps
- check test .vmfb files for tests/e2e/ like tosa_ops, stablehlo_ops
- matmul and convolution test suite .vmfb files
Running unit tests
- e.g. iree/hal/drivers/cuda/dynamic_symbols_test, iree/hal/drivers/cuda/cts/cuda_graph_driver_test
Running 'check' and matmul/conv tests
- e.g. iree/tests/e2e/linalg_ext_ops/check_vulkan-spirv_vulkan_winograd_input.mlir, iree/tests/e2e/matmul/e2e_matmul_cuda_f32_large_splitk_cuda_cuda

pkgci.yml

This runs a slightly different testing flow:

build_packages -> regression_test_amdgpu_vulkan

At the moment, all of these tests are written in Python and operate using just the package files (iree-compile and iree-run-module binaries, not other C/C++ sources or unit test executable files)

Tests cover these areas:

test_llama2.py and test_ukernel.py from https://github.com/iree-org/iree/tree/main/experimental/regression_suite/tests/pregenerated
Selections of ONNX op tests and PyTorch model tests from https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests

Expanding test coverage

Notice that ci.yml, where unit tests (like the HAL CTS) and 'check' tests (like the matmul test suite) are run, is not running on AMD GPUs right now. The simplest way to fix that will be to add test jobs like test_nvidia_gpu for test_amd_gpu_w7900, test_amd_gpu_mi250, etc. to ci.yml. We could instead expand on pkgci.yml, but the ci.yml setup makes more sense to expand.

A word of warning though - we've been skating around machine availability and reliability issues with our self-hosted w7900 and mi250 runner(s). Pkgci.yml failures do not block merging PRs, while ci.yml failures do. If we want to add tests to the core pipeline, the machines those tests run on need to be reliable at all hours of the day.

Apr 24 '24 17:04 ScottTodd

We can also include collectives (multi-gpu) tests on the AMDGPU machines. See this discord discussion. The mi250 and w7900 runners both have 4 GPUs.

Enabling multi-gpu testing once we have the baseline jobs might just be a matter of flipping the IREE_MULTI_DEVICE_TESTS_DISABLE=0 environment variable at first. We may instead want per-target filters or more fine-grained labels like requires-multiple-rocm-devices though.

cc @sogartar

May 02 '24 17:05 ScottTodd

This is sort of done and stable. Still needs a dedicated owner though. I'm auditing some of our test suites now and finding areas that are only tested on CPU/Vulkan/CUDA and not ROCm (or Metal, or WebGPU, ...)

Jun 27 '24 23:06 ScottTodd