xtensor
xtensor copied to clipboard
[WIP] Added skeleton of batch based GPU assignment
Checklist
DO NOT MERGE
- [ ] The title and commit message(s) are descriptive.
- [ ] Small commits made to fix your PR have been squashed to avoid history pollution.
- [ ] Tests have been added for new features or bug fixes.
- [ ] API of new functions and classes are documented.
Description
@JohanMabille This is a skeleton for how to move simple operations to the GPU using a similar strategy as with XSIMD. I'm curious if this would be an extensible strategy. I know the code doesn't compile I have made many short-cuts to demonstrate the concept.
Points of Concern:
- Containers are copied multiple times when referenced in multiple expressions rather than one immutable shadow copy.
- GPU memory allocations and host - device transfers are expensive
- Expressions are evaluated serially through the expression tree when multiple streams/thread could be used in a reduction tree.
- Each batch is essentially a kernel launch which has overhead... ie. no kernel fusion. (this would require us to generate kernels with metatemplate code... This would likely require implementing the assignment operation as a kernel launch across a thread grid)
- Currently proposing we use
thrustor something similar from AMD/Intel which will have a cost as well but this eliminates the need to worry about launching kernels, streams and synchronization. - The current method 'dispatches' work from the host to a device in an opaque way. We could also create a
gpu_containerfor the public interface and attempt to implement the assignment as a CUDA kernel.
https://github.com/xtensor-stack/xtensor/issues/192
Hi, im quite interest in this idea, maybe we can work together? do you still interest in this idea?