llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[SYCL][CUDA] accessor_api_image CTS test is failing

Open againull opened this issue 4 years ago • 7 comments

Build compiler git clone https://github.com/intel/llvm Hash: b00fb7c

Includes: #1990, #1977

python /localdisk2/ws/againull/sycl/llvm/buildbot/configure.py --cuda -o /localdisk2/ws/againull/sycl/build python /localdisk2/ws/againull/sycl/llvm/buildbot/compile.py -o /localdisk2/ws/againull/sycl/build

Build accessor CTS tests git clone https://github.com/KhronosGroup/SYCL-CTS.git Hash: 9cbe1a719b25c269ef78a2ee08f2e5ed12a1cc6d

Applied: KhronosGroup/SYCL-CTS#52

cmake -G Ninja -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -DINTEL_SYCL_ROOT=/localdisk2/ws/againull/sycl/build -DINTEL_SYCL_TRIPLE=nvptx64-nvidia-cuda-sycldevice -DSYCL_IMPLEMENTATION=Intel_SYCL -DSYCL_CTS_ENABLE_OPENCL_INTEROP_TESTS=Off -DSYCL_CTS_ENABLE_DOUBLE_TESTS=On -DSYCL_CTS_ENABLE_HALF_TESTS=On -DINTEL_SYCL_FLAGS="-Xsycl-target-backend;--cuda-gpu-arch=sm_50" -DOpenCL_INCLUDE_DIR=/localdisk2/ws/againull/sycl/build/include/sycl -DOpenCL_LIBRARY=/localdisk2/ws/againull/sycl/build/lib/libOpenCL.so ..

ninja test_accessor -j 12

Run accessor_api_image CTS test => ./bin/test_accessor -p nvidia -d opencl_gpu --test accessor_api_image --- accessor_api_image . accessor<vec<int32_t, 4>, 1, mode{1024}, target{2017}> . Checking get_range

PI CUDA ERROR: Value: 500 Name: CUDA_ERROR_NOT_FOUND Description: named symbol not found Function: build_program Source Location: /iusers/againull/sycl/llvm/sycl/plugins/cuda/pi_cuda.cpp:468

PI CUDA ERROR: Value: 400 Name: CUDA_ERROR_INVALID_HANDLE Description: invalid resource handle Function: cuda_piProgramRelease Source Location: /iusers/againull/sycl/llvm/sycl/plugins/cuda/pi_cuda.cpp:2807

. sycl exception caught . what - The program was built for 1 devices Build program log for 'GeForce GTX 1060 6GB': -999 (Unknown OpenCL error code) . line: 63 . a SYCL exception was caught: The program was built for 1 devices Build program log for 'GeForce GTX 1060 6GB': -999 (Unknown OpenCL error code)

  • fail

--- accessor_api_image_fp16 . Device does not support half precision floating point operations

  • pass

. Passed 1/14 tests (7%)

againull avatar Jul 15 '20 07:07 againull

@againull, @pvchupin, I think we figured out that this was caused by the regression in the driver. Can we close this one?

bader avatar Jul 22 '20 09:07 bader

@bader Yes. Test passes with 435.21 nvidia driver when https://github.com/KhronosGroup/SYCL-CTS/pull/52 is applied. Do you know when this PR it is going to be merged?

againull avatar Jul 22 '20 16:07 againull

It looks like issue is still reproducible on latest driver 450.102.04. Let's reopen it, at least for the tracking purpose.

pvchupin avatar Mar 23 '21 22:03 pvchupin

The 500 CUDA_ERROR_NOT_FOUND (that turns in a 801 CUDA_ERROR_NOT_SUPPORTED for CUDA toolkit 11.3 and above) is caused by the suq.depth PTX instruction in 3d sampled readings at

https://github.com/intel/llvm/blob/90c8f0543a38adeda75ad2eca7e999a36a1f2697/libclc/ptx-nvidiacl/libspirv/images/image.cl#L151 https://github.com/intel/llvm/blob/90c8f0543a38adeda75ad2eca7e999a36a1f2697/libclc/ptx-nvidiacl/libspirv/images/image.cl#L158

NVIDIA tells that this error is expected since this PTX instructions work just in case they are used within the 'OpenCL driver'. Unfortunately, they added that there are no ways to use this instruction with CUDA. They promised to update the PTX documentation in order to make it clear. If suq.depth is present in the fatbin, it produces the aforementioned error even if it is not actually executed. For this reason, it has been removed with https://github.com/intel/llvm/pull/5378.

Another error emerged when the one above is factored out, a 700 CUDA_ERROR_ILLEGAL_ADDRESS. Which appeared in sampled readings with linear filtering. This has been fixed with https://github.com/intel/llvm/pull/5204.

Unfortunately, the previous two errors are just the tip of the iceberg. Passing the accessor_api_image_core test implies the support of (u)int 8bit channels and their related conversion functions, which are completely missing right now: (u)int accessors can read (u)int_{32,16,8}b channels. Further details can be found in Section 6.12.14 and 8.3 of the OpenCL 1.2 Specification.

In order to let this test pass, the image support has been marked as experimental and deactivated by default with https://github.com/intel/llvm/pull/5204.

In summary:

  • the only type supported are (u)int (32bit), float and half,
  • 8bit channels and related conversion functions are missing,
  • writings and non-sampled readings are supported,
  • 1d/2d sampled readings are supported,
  • 3d sampled readings are not functioning due to suq.depth.

pgorlani avatar Jan 27 '22 09:01 pgorlani

@pgorlani, thanks a lot for nice and detailed summary!

pvchupin avatar Jan 28 '22 00:01 pvchupin

Recent version of test_all (and test_accessor_legacy) aborts on NVIDIA GPU.

pi_die: PI CUDA kernels only support images with channel types int32, uint32, float, and half.
terminate called without an active exception
-------------------------------------------------------------------------------
accessor_api_image_core
-------------------------------------------------------------------------------
SYCL-CTS/tests/accessor_legacy/../common/../../util/proxy.h:35
...............................................................................

SYCL-CTS/tests/accessor_legacy/../common/../../util/proxy.h:35: FAILED:
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

Can we change pi_die to an exception? pi_die aborts the execution of the whole test suite, whereas exception will just fail a single test.

bader avatar Aug 01 '22 18:08 bader

Recent version of test_all (and test_accessor_legacy) aborts on NVIDIA GPU.

pi_die: PI CUDA kernels only support images with channel types int32, uint32, float, and half.
terminate called without an active exception
-------------------------------------------------------------------------------
accessor_api_image_core
-------------------------------------------------------------------------------
SYCL-CTS/tests/accessor_legacy/../common/../../util/proxy.h:35
...............................................................................

SYCL-CTS/tests/accessor_legacy/../common/../../util/proxy.h:35: FAILED:
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

Can we change pi_die to an exception? pi_die aborts the execution of the whole test suite, whereas exception will just fail a single test.

Hi @bader, here is https://github.com/intel/llvm/pull/6521, it should fit our needs.

pgorlani avatar Aug 03 '22 13:08 pgorlani

This seems to be addressed by the change to an error report rather than exit/die so closing.

rodburns avatar Jun 23 '23 10:06 rodburns