kokkos-kernels icon indicating copy to clipboard operation
kokkos-kernels copied to clipboard

HIP backend general issue

Open lucbv opened this issue 5 years ago • 7 comments

This issue is meant to centralize issues and work being done to integrate the HIP backend in Kokkos-Kernels. Ideally I would like other issues to be opened for specific technical issues to be opened and then referenced here so that users and developers would know what the known issues are and who is working on them.

lucbv avatar Sep 10 '20 23:09 lucbv

Here is a list of the current issues observed while building with the HIP backend:

  • [x] Kokkos_ArithTraits long double specialization, see issue #807, PR #809 and PR #844
  • [x] KokkosBatched Algo::Level3::Blocked::mb() is not defined, see issue #808 and PR #812
  • [x] Parallel Range (Kokkos core issue), some parameters need to be casted for template deduction, see Kokkos: issue #3386 and PR #3393
  • [x] unit-tests CMakeList needs to be edited to add logic for HIP testing and ETI, see issue #819 and PR #820, PR #841
  • [x] add logic in cm_generate_makefile to support HIP builds, see PR #818
  • [x] add logic in test_all_sandia to allow spot_check on caraway (AMD/HIP platform), see PR #842
  • [x] add CMake logic to disable unit-test categories selectively, see PR #822
  • [x] clean-up logic in code in the execution_space=Kokkos::Experimental::HIP path, see PR #828 and PR #840

Now that the ETI and tests are merged (or are about to be), we can make a list of what still needs to be done to get the backend fully functional.

HIP spot-check enabled tests

  • [x] BLAS
  • [x] batchedDLA
  • [ ] Sparse
  • [ ] Graph
  • [x] Common

HIP tests currently failing

Issues in batchedDLA

  1. batched_scalar_team_trsm_l_u_nt_n_dcomplex_dcomplex fails with a bunch of values == 0 which seems to indicate a memory issue with complex?
  2. batched_scalar_team_trsm_l_u_t_n_dcomplex_dcomplex aborts on Memory access fault by GPU
  3. batched_scalar_team_trsm_l_u_nt_n_dcomplex_double same as dcomplex_dcomple version
  4. batched_scalar_team_trsm_l_u_t_n_dcomplex_double same as dcomplex_dcomple version
  5. batched_scalar_teamvector_qr_with_columnpivoting_double aborts on Device::callbackQueue aborting with status: 0x29
  6. batched_scalar_teamvector_solve_utv_double aborts on Memory access fault by GPU
  7. batched_scalar_teamvector_solve_utv2_double aborts on Memory access fault by GPU after failing with values == 0
  8. batched_scalar_teamvector_utv_double aborts on Memory access fault by GPU

Issues in Graph (offset==int and offset==size_t fail in the same way)

  1. graph_graph_color_double_int_int_TestExecSpace aborts on Memory access fault by GPU
  2. graph_graph_color_distance2_double_int_int_TestExecSpace aborts on Memory access fault by GPU
  3. graph_graph_color_deterministic_double_int_int_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x1016

Issues in Sparse (offset==int and offset==size_t fail in the same way)

  1. sparse_gauss_seidel_asymmetric_rank1_kokkos_complex_double_int_int_TestExecSpace aborts on Memory access fault by GPU, Note: same happens with rank2 and/or symmetric tests
  2. sparse_balloon_clustering_double_int_int_TestExecSpace aborts on Memory access fault by GPU, Note: happens randomly so quick possibly related to race condition?
  3. sparse_replaceSumIntoLonger_double_int_int_TestExecSpace fails with values == 0
  4. sparse_replaceSumIntoLonger_kokkos_complex_double_int_int_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x1016
  5. sparse_replaceSumInto_kokkos_complex_double_int_int_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x1016
  6. sparse_spgemm_jacobi_kokkos_complex_double_int_size_t_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x29
  7. sparse_spmv_kokkos_complex_double_int_int_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x1016
  8. sparse_spmv_mv_kokkos_complex_double_int_int_LayoutLeft_TestExecSpace aborts on Device::callbackQueue aborting with status: 0x1016

lucbv avatar Sep 12 '20 17:09 lucbv

@lucbv I'll add amd/caraway options for the testing scripts this week

ndellingwood avatar Sep 14 '20 21:09 ndellingwood

Thanks, I have shared my current configuration on the internal repo (see the Technical tips section on the homepage). One thing that I need to do is ask what extra flags are used by Kokkos for AMG builds, currently I removed all the warning/error flags as Kokkos would not build otherwise.

lucbv avatar Sep 14 '20 22:09 lucbv

@lucbv I have a branch now that passes unit tests for CUDA, Serial, OpenMP but will (hopefully) also work on HIP when then unit tests are built for it. The only things still hardcoded for CUDA are things involving cusparse, cublas, graphs and streams. There are a couple places where __CUDA_ARCH__ is used but that is still defined for HIP so it should be OK.

brian-kelley avatar Oct 09 '20 19:10 brian-kelley

@brian-kelley thanks for looking at this, I am still waiting on rocm/3.8.0 tests to move with the ETI/tests PR as I feel it might fix quite a few things. Hopefully I can get that done next week but I'm not sure. If your PR is ready feel free to put me as a reviewer, I will finish my review of the coarsening PR this weekend.

lucbv avatar Oct 10 '20 21:10 lucbv

Using the latest rocm LLVM compiler the new list of failing tests is much shorter:

Graph

[ RUN ] hip.graph_graph_color_deterministic_double_int_int_TestExecSpace :0:rocdevice.cpp :2325: 378970770383 us: Device::callbackQueue aborting with status: 0x1016 Aborted (core dumped) [ RUN ] hip.graph_graph_color_double_int_size_t_TestExecSpace :0:rocdevice.cpp :2325: 379268378835 us: Device::callbackQueue aborting with status: 0x1016 Aborted (core dumped)

Sparse

Some failures related to complex atomics, updates in Kokkos Core should resolve these issues.

lucbv avatar Apr 28 '21 15:04 lucbv

More things are working now - with rocm 4.5 and MI100 (on Caraway) all tests pass except for structured SpMV (hip.sparse_spmv_struct_double_int_size_t_TestExecSpace).

brian-kelley avatar Jan 11 '22 23:01 brian-kelley

At this point we are testing HIP in our CI, everything is building correct : )

lucbv avatar Dec 19 '22 20:12 lucbv