gtensor
gtensor copied to clipboard
Create generic launch and assign kernels
The CUDA/HIP implementations are fragile in that there may be array sizes that overflow certain limits. Explore using linear launch indexing and mapping back to expression indexes. This requires integer divide and modulo, which may hurt performance, but depending on computational intensity may not hurt performance and my simplify the launch routines a lot. Ideally we have one generic launch for any number of dimensions.
The WIP sycl implementation currently only goes up to 3 dims using range instead of nd_range, and similar challenges apply.