llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[SYCL] Avoid infinite loop when kernel fails to compile with memory error

Open VerenaBeckham opened this issue 7 months ago • 4 comments

The runtime tries to build a kernel again if compilation fails. But if UR returns a memory error the attempt counter was not compared against the maximum number of attempts, so the compiler was continuously called and eventually the loop counter would have overflowed.

VerenaBeckham avatar Jun 10 '25 12:06 VerenaBeckham

@VerenaBeckham is it possible to add a test that would fail without your patch and work after your patch?

maarquitos14 avatar Jun 10 '25 14:06 maarquitos14

@VerenaBeckham is it possible to add a test that would fail without your patch and work after your patch?

I'm not sure. Even if you could write a test that allocates so much memory that it runs out, this test might still pass on a different architecture with more memory. Or do you know of a way to mock up this error?

VerenaBeckham avatar Jun 11 '25 10:06 VerenaBeckham

@VerenaBeckham is it possible to add a test that would fail without your patch and work after your patch?

I'm not sure. Even if you could write a test that allocates so much memory that it runs out, this test might still pass on a different architecture with more memory. Or do you know of a way to mock up this error?

I think a unit test with UrMock should do the trick for that: https://github.com/intel/llvm/blob/b437083ea7b827ea6798e3fcfae1fe0f926d1d6d/sycl/doc/developer/ContributeToDPCPP.md#dpc-headers-and-runtime-tests

You can find examples if you grep UrMock within sycl/unittests. Basically, UrMock allows you to redefine the behavior of any UR function for a given test, so you could redefine the function that is triggering the memory error to always return a memory error without even trying to allocate.

maarquitos14 avatar Jun 11 '25 13:06 maarquitos14

@VerenaBeckham is it possible to add a test that would fail without your patch and work after your patch?

I'm not sure. Even if you could write a test that allocates so much memory that it runs out, this test might still pass on a different architecture with more memory. Or do you know of a way to mock up this error?

I think a unit test with UrMock should do the trick for that: https://github.com/intel/llvm/blob/b437083ea7b827ea6798e3fcfae1fe0f926d1d6d/sycl/doc/developer/ContributeToDPCPP.md#dpc-headers-and-runtime-tests

You can find examples if you grep UrMock within sycl/unittests. Basically, UrMock allows you to redefine the behavior of any UR function for a given test, so you could redefine the function that is triggering the memory error to always return a memory error without even trying to allocate.

Thanks for the hint! I have added a test.

VerenaBeckham avatar Jun 11 '25 17:06 VerenaBeckham

If you're happy with this @maarquitos14 do you want to "approve the workflows" and kick off the testing?

VerenaBeckham avatar Jun 19 '25 09:06 VerenaBeckham

I'm sorry, I forgot to push my clang format fixes. 🤦 Fixed now.

VerenaBeckham avatar Jun 19 '25 09:06 VerenaBeckham