Run unit tests on AMDGPU CI
We have two main CI pipelines:
- https://github.com/iree-org/iree/blob/main/.github/workflows/ci.yml
- https://github.com/iree-org/iree/blob/main/.github/workflows/pkgci.yml
Pipeline overview
ci.yml
The main testing flows for hardware targets are:
build_all->test_nvidia_a100,test_nvidia_gpubuild_all->cross_compile_and_test[riscv_64, riscv_32, emscripten]build_all->build_and_test_android[moto edge, pixel 6 pro]
The test jobs cover these areas:
- Building the runtime from source
- Building
iree-test-deps- check test .vmfb files for
tests/e2e/like tosa_ops, stablehlo_ops - matmul and convolution test suite .vmfb files
- check test .vmfb files for
- Running unit tests
- e.g.
iree/hal/drivers/cuda/dynamic_symbols_test,iree/hal/drivers/cuda/cts/cuda_graph_driver_test
- e.g.
- Running 'check' and matmul/conv tests
- e.g.
iree/tests/e2e/linalg_ext_ops/check_vulkan-spirv_vulkan_winograd_input.mlir,iree/tests/e2e/matmul/e2e_matmul_cuda_f32_large_splitk_cuda_cuda
- e.g.
pkgci.yml
This runs a slightly different testing flow:
build_packages->regression_test_amdgpu_vulkan
At the moment, all of these tests are written in Python and operate using just the package files (iree-compile and iree-run-module binaries, not other C/C++ sources or unit test executable files)
Tests cover these areas:
test_llama2.pyandtest_ukernel.pyfrom https://github.com/iree-org/iree/tree/main/experimental/regression_suite/tests/pregenerated- Selections of ONNX op tests and PyTorch model tests from https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests
Expanding test coverage
Notice that ci.yml, where unit tests (like the HAL CTS) and 'check' tests (like the matmul test suite) are run, is not running on AMD GPUs right now. The simplest way to fix that will be to add test jobs like test_nvidia_gpu for test_amd_gpu_w7900, test_amd_gpu_mi250, etc. to ci.yml. We could instead expand on pkgci.yml, but the ci.yml setup makes more sense to expand.
A word of warning though - we've been skating around machine availability and reliability issues with our self-hosted w7900 and mi250 runner(s). Pkgci.yml failures do not block merging PRs, while ci.yml failures do. If we want to add tests to the core pipeline, the machines those tests run on need to be reliable at all hours of the day.
We can also include collectives (multi-gpu) tests on the AMDGPU machines. See this discord discussion. The mi250 and w7900 runners both have 4 GPUs.
Enabling multi-gpu testing once we have the baseline jobs might just be a matter of flipping the IREE_MULTI_DEVICE_TESTS_DISABLE=0 environment variable at first. We may instead want per-target filters or more fine-grained labels like requires-multiple-rocm-devices though.
cc @sogartar
This is sort of done and stable. Still needs a dedicated owner though. I'm auditing some of our test suites now and finding areas that are only tested on CPU/Vulkan/CUDA and not ROCm (or Metal, or WebGPU, ...)