Liger-Kernel
Liger-Kernel copied to clipboard
[ROCm]: State of Liger Kernel CI on AMD on ROCm 6.3
🐛 Describe the bug
State of Liger Kernel CI on MI300X on ROCm 6.3
Environment:
python3 -m pip list | grep triton
pytorch-triton-rocm 3.3.0+git96316ce5
triton 3.2.0
python3 -m pip list | grep torch
pytorch-triton-rocm 3.3.0+git96316ce5
torch 2.8.0.dev20250321+rocm6.3
-
make testThis is the failure case:test/transformers/test_jsd.py::test_correctness_with_beta[0.9-dtype1-1e-08-1e-06-2-1024-3200]it seems that theatolof this test on AMD if relaxed to1e-07, the tests will pass. -
make test-convergencetest/convergence/fp32/test_mini_models_with_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype3-1e-08-1e-05-0.005-1e-05-0.005-1e-05]it seems that theloss_atolof this test on AMD if relaxed from1e-8to1e-05, the tests will pass.
Follow up
I would like to get some opinions on:
- move the AMD CI unit tests to rocm 6.3 only as it is getting more optimization and bug fixes. On the pytorch main webpage, they have also moved to rocm 6.3 (https://pytorch.org/get-started/locally/)
- relax the tolerance for those test cases if they are still at acceptable value.
Reproduce
No response
Versions
Environment Report:
-------------------
Operating System: Linux-5.15.0-116-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Liger Kernel version: 0.5.5
PyTorch version: 2.8.0.dev20250321+rocm6.3
CUDA version: None
HIP(ROCm) version: 6.3.42131-fa1d09cbd
Triton version: 3.3.0
Transformers version: 4.49.0
XPU version: XPU Not Available