[WIP] Added skeleton of batch based GPU assignment

Open spectre-ns opened this issue 1 year ago • 1 comments

Checklist

DO NOT MERGE

[ ] The title and commit message(s) are descriptive.
[ ] Small commits made to fix your PR have been squashed to avoid history pollution.
[ ] Tests have been added for new features or bug fixes.
[ ] API of new functions and classes are documented.

Description

@JohanMabille This is a skeleton for how to move simple operations to the GPU using a similar strategy as with XSIMD. I'm curious if this would be an extensible strategy. I know the code doesn't compile I have made many short-cuts to demonstrate the concept.

Points of Concern:

Containers are copied multiple times when referenced in multiple expressions rather than one immutable shadow copy.
- GPU memory allocations and host - device transfers are expensive
Expressions are evaluated serially through the expression tree when multiple streams/thread could be used in a reduction tree.
Each batch is essentially a kernel launch which has overhead... ie. no kernel fusion. (this would require us to generate kernels with metatemplate code... This would likely require implementing the assignment operation as a kernel launch across a thread grid)
Currently proposing we use thrust or something similar from AMD/Intel which will have a cost as well but this eliminates the need to worry about launching kernels, streams and synchronization.
The current method 'dispatches' work from the host to a device in an opaque way. We could also create a gpu_container for the public interface and attempt to implement the assignment as a CUDA kernel.

https://github.com/xtensor-stack/xtensor/issues/192

Jan 04 '25 18:01 spectre-ns

Hi, im quite interest in this idea, maybe we can work together? do you still interest in this idea?

Apr 29 '25 08:04 Roy-Kid