kokkos-kernels icon indicating copy to clipboard operation
kokkos-kernels copied to clipboard

KokkosBatched::SerialSVD::invoke(..) hang in Kokkos v4.6

Open amk227 opened this issue 9 months ago • 5 comments

Hi,

I'm seeing a hang (infinite loop) in SerialSVD in Kokkos v4.6:

TEST(KokkosSerialSVD, does_not_solve2)
{
  Kokkos::View<double[3][6], Kokkos::HostSpace> A(Kokkos::ViewAllocateWithoutInitializing("A"));
  Kokkos::View<double[3][3], Kokkos::HostSpace> U(Kokkos::ViewAllocateWithoutInitializing("U"));
  Kokkos::View<double[6][6], Kokkos::HostSpace> V(Kokkos::ViewAllocateWithoutInitializing("V"));
  Kokkos::View<double[3], Kokkos::HostSpace> S(Kokkos::ViewAllocateWithoutInitializing("S"));
  Kokkos::View<double[30], Kokkos::HostSpace> work(Kokkos::ViewAllocateWithoutInitializing("work"));

  A(0, 0) = -2.3588494081694974e-03;
  A(0, 1) = -2.3602176428346553e-03;
  A(0, 2) = -3.3360574050870077e-03;
  A(0, 3) = -2.3589487578561312e-03;
  A(0, 4) = -3.3359167956075490e-03;
  A(0, 5) = -3.3378517656821728e-03;
  A(1, 0) = 3.3359168246290603e-03;
  A(1, 1) = 3.3378518006490351e-03;
  A(1, 3) = 3.3360573263032968e-03;
  A(2, 0) = -2.3588494081695022e-03;
  A(2, 1) = -2.3602176428346587e-03;
  A(2, 2) = 3.3360574050869769e-03;
  A(2, 3) = -2.3589487578561286e-03;
  A(2, 4) = 3.3359167956075399e-03;
  A(2, 5) = 3.3378517656821581e-03;

  KokkosBatched::SerialSVD::invoke(KokkosBatched::SVD_USV_Tag{}, A, U, S, V, work, 1e-12);
}

Compiler:

gcc-12.3.0

Thanks, -Alec

amk227 avatar May 21 '25 20:05 amk227

Upon further testing it looks like setting the tolerance to 1e-11 stops the infinite loop.. not sure why this is.

amk227 avatar May 21 '25 20:05 amk227

This might be related to this issue filed a couple months ago:

https://github.com/kokkos/kokkos-kernels/issues/2557

if that helps.

amk227 avatar May 21 '25 20:05 amk227

add-iteration-limit-to-SVD.patch

@lucbv This is high priority and is impacting our production cases, so I've added the attached patch to our kokkos kernels spack setup for now. Can you prioritize getting a change like this into Kokkos Kernels directly? If we can provide a way to avoid an infinite loop we can handle the error by relaxing the tolerance, and/or perturbing/shuffling the input order.

tvoskui avatar Jul 18 '25 15:07 tvoskui

Yes, I will review the patch and make it into a PR as well as add unit-test to make sure this does not creep back up.

lucbv avatar Jul 18 '25 19:07 lucbv

Hi, we have another failing case:

TEST(RotateRows, BadSVDDefaultTol)
{
  Kokkos::View<double **, Kokkos::LayoutRight, Kokkos::HostSpace> A("A", 3, 6);
  Kokkos::View<double **, Kokkos::LayoutRight, Kokkos::HostSpace> U("U", 3, 3);
  Kokkos::View<double **, Kokkos::LayoutRight, Kokkos::HostSpace> V("V", 6, 6);
  Kokkos::View<double *, Kokkos::HostSpace> S("S", 3);
  Kokkos::View<double *, Kokkos::HostSpace> work("work", 30);
  Kokkos::View<double **, Kokkos::LayoutRight, Kokkos::HostSpace> A_scratch("A_scratch", 3, 6);

  A(0, 0) = -0.49992589104804802114;
  A(0, 1) = -0.50016956949997615212;
  A(0, 2) = 0.70697176137856687639;
  A(0, 3) = 0.70734658093545688118;
  A(0, 4) = -0.49990454757986246825;
  A(0, 5) = 0.70700198975105144061;
  A(1, 0) = 0.70700197530160280301;
  A(1, 1) = 0.70734658867317878883;
  A(1, 2) = 0.00000000000000052857;
  A(1, 3) = 0.00000000000000411362;
  A(1, 4) = 0.70697179107942709209;
  A(1, 5) = 0.00000000000000392977;
  A(2, 0) = -0.49992589104804807665;
  A(2, 1) = -0.50016956949997559700;
  A(2, 2) = -0.70697176137856798661;
  A(2, 3) = -0.70734658093545643709;
  A(2, 4) = -0.49990454757986169110;
  A(2, 5) = -0.70700198975105188470;

  const double tol = 1e-12;
  const int max_iters = 1000;
  const int svd_err = KokkosBatched::SerialSVD::invoke(
      KokkosBatched::SVD_USV_Tag{}, A, U, S, V, work, tol, max_iters);

  EXPECT_FALSE(svd_err == 0);
}

In this case it will early exit after 1000 iterations (otherwise it will infinitely loop) If we drop the tolerance by 4 orders in magnitude and increase max_iters by 4 orders in magnitude we get the correct U matrix.

amk227 avatar Sep 04 '25 15:09 amk227