Hugh Delaney issues

Results 37 issues of


                                            Hugh Delaney

[SYCL][CUDA] Enable CXX standard library funcs for CUDA backend

This PR allows CXX stdlib funcs to be used for NVPTX backend. See https://github.com/intel/llvm/discussions/6379 llvm-test-suite test: https://github.com/intel/llvm-test-suite/pull/1112 It also adds the compiler flag `"-fbundle-no-offload-arch"`, which allows device code bundles to...

[SYCL][ext] Add always_inline attribute to `round_to_tf32`

This should have been defined with always inline to avoid multiple symbols in multi object compilations

[SYCL][ext] Add host impl for bf16 conversion

Instead of throwing an error, it would be convenient if bfloat16 conversions could be done on host as well as device. cc @JackAKirk

Change Structure of syclacademy

The current structure of syclacademy presents the buffer/accessor model before USM. This chapter makes the assumption that the programmer knows the difference between device and host memory, as well as...

Adding matrix transpose exercise

Adding new exercise for matrix transpose, a simple intro to coalesced global mem accesses as well as local memory. Let me know if you think this should go somewhere else.

[CUDA][HIP] Use device to get native context

Since https://github.com/oneapi-src/unified-runtime/pull/999 it is no longer valid to get the native context from the SYCL context on a multi GPU system. The get native func for contexts has been deprecated...

Every `CUDA_ERROR_FUNC` could allow a memory leak

If `CUDA_ERROR_FUNC`, `CUSOLVER_ERROR_FUNC` etc is called and the result `!= CUDA_SUCCESS`, a `cuda_error` will be thrown and any allocated pointers will not be deallocated, causing a memory leak. We should...

[UR] CI for UR PR refactor-guess-local-worksize

https://github.com/oneapi-src/unified-runtime/pull/1326

[UR] Run CI for UR PR

[SYCL] Add force range rounding option and introduce new compiler flag

Adds a new preference for range rounding, force, such that if the compile flag is used, only the range rounded parallel_for kernel will be generated. This can make binaries smaller...