Liger-Kernel icon indicating copy to clipboard operation
Liger-Kernel copied to clipboard

[ROCm]: State of Liger Kernel CI on AMD on ROCm 6.3

Open tjtanaa opened this issue 8 months ago • 0 comments

🐛 Describe the bug

State of Liger Kernel CI on MI300X on ROCm 6.3

Environment:

  • python3 -m pip list | grep triton
pytorch-triton-rocm               3.3.0+git96316ce5
triton                            3.2.0
  • python3 -m pip list | grep torch
pytorch-triton-rocm               3.3.0+git96316ce5
torch                             2.8.0.dev20250321+rocm6.3
  1. make test This is the failure case: test/transformers/test_jsd.py::test_correctness_with_beta[0.9-dtype1-1e-08-1e-06-2-1024-3200] it seems that the atol of this test on AMD if relaxed to 1e-07, the tests will pass.

  2. make test-convergence test/convergence/fp32/test_mini_models_with_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype3-1e-08-1e-05-0.005-1e-05-0.005-1e-05] it seems that the loss_atol of this test on AMD if relaxed from 1e-8 to 1e-05, the tests will pass.

Follow up

I would like to get some opinions on:

  1. move the AMD CI unit tests to rocm 6.3 only as it is getting more optimization and bug fixes. On the pytorch main webpage, they have also moved to rocm 6.3 (https://pytorch.org/get-started/locally/)
  2. relax the tolerance for those test cases if they are still at acceptable value.

Reproduce

No response

Versions

Environment Report:
-------------------
Operating System: Linux-5.15.0-116-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Liger Kernel version: 0.5.5
PyTorch version: 2.8.0.dev20250321+rocm6.3
CUDA version: None
HIP(ROCm) version: 6.3.42131-fa1d09cbd
Triton version: 3.3.0
Transformers version: 4.49.0
XPU version: XPU Not Available

tjtanaa avatar Mar 22 '25 10:03 tjtanaa