llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[SYCL] Add invalid kernel handling

Open KrystianChmielewski opened this issue 3 years ago • 12 comments

This commit adds support for handling invalid kernels.

Sometimes kernel cannot be generated for specific platform, when trying to append launch of such faulty kernel exception is now thrown - informing why used kernel is invalid.

Signed-off-by: Krystian Chmielewski [email protected]

KrystianChmielewski avatar Aug 29 '22 16:08 KrystianChmielewski

BTW. It would be nice to have a test.

romanovvlad avatar Aug 30 '22 07:08 romanovvlad

This commit adds support for handling invalid kernels.

Sometimes kernel cannot be generated for specific platform, when trying to append launch of such faulty kernel exception is now thrown - informing why used kernel is invalid.

Signed-off-by: Krystian Chmielewski [email protected]

What is the motivating issue? Are you going to add an E2E test too?

smaslov-intel avatar Aug 30 '22 10:08 smaslov-intel

@smaslov-intel "What is the motivating issue? Are you going to add an E2E test too?"

Kernel's compiled from spirv using fp64 on platforms without fp64 support are poisoned (code is replaced) with printf. This is done to enable building module with fp64 kernels on platforms without fp64 support. This change aims to catch this and return appropriate error to user, when encountering poisoned kernel.

KrystianChmielewski avatar Aug 30 '22 11:08 KrystianChmielewski

Kernel's compiled from spirv using fp64 on platforms without fp64 support are poisoned (code is replaced) with printf. This is done to enable building module with fp64 kernels on platforms without fp64 support.

Why does a platform need the support fp64 to just build kernel, not to run it?

This change aims to catch this and return appropriate error to user, when encountering poisoned kernel.

Are you saying that the regexp you are trying to match is something that is printed with that printf? If so, it looks like a hack to me. Isn't there a more robust solution to the issue?

smaslov-intel avatar Aug 30 '22 11:08 smaslov-intel

Why does a platform need the support fp64 to just build kernel, not to run it?

That's how intel-graphics-compiler works. The problem is if user tries to run this kernel. It would be much better to get an error instead of executing said kernel and getting information from result of kernel's printf.

Are you saying that the regexp you are trying to match is something that is printed with that printf? If so, it looks like a hack to me. Isn't there a more robust solution to the issue?

No. Compiler appends attribute to the kernel "invalid_kernel(reason)". Regexp that I am matching is one of kernel's attributes

KrystianChmielewski avatar Aug 30 '22 11:08 KrystianChmielewski

That's how intel-graphics-compiler works. The problem is if user tries to run this kernel. It would be much better to get an error instead of executing said kernel and getting information from result of kernel's printf.

This is for AOT compilation, right? Wouldn't it be much more friendly to give an error at AOT compilation-time (and request JIT compilation) rather than deferring to run-time? Mis-compiling the kernel (which printf is essentially doing) and then trying to understand/report this at run-time looks weird at least.

Compiler appends attribute to the kernel "invalid_kernel(reason)". Regexp that I am matching is one of kernel's attributes

Is it possible for user to insert "attribute" in source code and then have a false positive diagnostic in case of a coincidental regexp match?

smaslov-intel avatar Aug 31 '22 13:08 smaslov-intel

Wouldn't it be much more friendly to give an error at AOT compilation-time (and request JIT compilation) rather than deferring to run-time?

This is a must to allow same build DL frameworks for PVC/ATS-M. The requirements for it to work are:

  1. Build time: no failures should be reported. - do not give error during AOT compilation
  2. Run time: execution on pvc/ats-m should report error in case application uses fp64. In AOT case 'assert' should be injected. This is done by passing "invalid_kernel" attribute.

Is it possible for user to insert "attribute" in source code and then have a false positive diagnostic in case of a coincidental regexp match?

It's impossible to have false positive. Regexp only matches attribute "invalid_kernel(reason)"

KrystianChmielewski avatar Aug 31 '22 14:08 KrystianChmielewski

Wouldn't it be much more friendly to give an error at AOT compilation-time (and request JIT compilation) rather than deferring to run-time?

This is a must to allow same build DL frameworks for PVC/ATS-M. The requirements for it to work are:

  1. Build time: no failures should be reported. - do not give error during AOT compilation
  2. Run time: execution on pvc/ats-m should report error in case application uses fp64. In AOT case 'assert' should be injected. This is done by passing "invalid_kernel" attribute.

What is the benefit of pretending we have the right kernel at build-time (where we already know it is not going to work) and report an error at run-time?

Is it possible for user to insert "attribute" in source code and then have a false positive diagnostic in case of a coincidental regexp match?

It's impossible to have false positive. Regexp only matches attribute "invalid_kernel(reason)"

What if something else (user) injects the matching attribute?

smaslov-intel avatar Aug 31 '22 15:08 smaslov-intel

What is the benefit of pretending we have the right kernel at build-time (where we already know it is not going to work) and report an error at run-time?

We can use single spirv with multiple kernels (including those using fp64) and compile it to native binary for every device.

What if something else (user) injects the matching attribute?

If user injects "invalid_kernel(reason)" attribute then PI_ERROR_PLUGIN_SPECIFIC_ERROR error with message description "invalid_kernel(reason)" will be returned as expected.

KrystianChmielewski avatar Sep 01 '22 08:09 KrystianChmielewski

We can use single spirv with multiple kernels (including those using fp64) and compile it to native binary for every device.

But you said just earlier that you can't compile fp64 code correctly if host devices doesn't support fp64. So, what is the purpose of not giving an error in compile time if the fp64 kernel is miscompiled? Maybe, you have a design doc explaining the approach in details?

If user injects "invalid_kernel(reason)" attribute then PI_ERROR_PLUGIN_SPECIFIC_ERROR error with message description "invalid_kernel(reason)" will be returned as expected.

Why is this "as expected"? I think users can expect completely something different, and may not be happy with RT reporting error in this case. There is no compiler reserved attributes that app can't use, are there?

smaslov-intel avatar Sep 01 '22 13:09 smaslov-intel

Maybe, you have a design doc explaining the approach in details?

Please look into jira XDEPS-4268.

Why is this "as expected"? I think users can expect completely something different, and may not be happy with RT reporting error in this case. There is no compiler reserved attributes that app can't use, are there?

If user adds attribute "reqd_work_group_size(X,Y,Z)" then it's expected for work group size to be (X, Y, Z) (any other work group size on setGroupSize call will result in an error). If user adds attribute "invalid_kernel(reason)" then it's expected to return an error on appendLaunchKernel

KrystianChmielewski avatar Sep 02 '22 08:09 KrystianChmielewski

If user adds attribute "reqd_work_group_size(X,Y,Z)" then it's expected for work group size to be (X, Y, Z) (any other work group size on setGroupSize call will result in an error). If user adds attribute "invalid_kernel(reason)" then it's expected to return an error on appendLaunchKernel

Where is this behavior documented so that users know that?

smaslov-intel avatar Sep 14 '22 16:09 smaslov-intel

If user adds attribute "reqd_work_group_size(X,Y,Z)" then it's expected for work group size to be (X, Y, Z) (any other work group size on setGroupSize call will result in an error). If user adds attribute "invalid_kernel(reason)" then it's expected to return an error on appendLaunchKernel

Where is this behavior documented so that users know that?

@KrystianChmielewski : can you help understand this?

smaslov-intel avatar Sep 27 '22 19:09 smaslov-intel

@smaslov-intel where should it be documented?

KrystianChmielewski avatar Sep 28 '22 08:09 KrystianChmielewski