xtensor icon indicating copy to clipboard operation
xtensor copied to clipboard

[WIP] Added skeleton of batch based GPU assignment

Open spectre-ns opened this issue 1 year ago • 1 comments

Checklist

DO NOT MERGE

  • [ ] The title and commit message(s) are descriptive.
  • [ ] Small commits made to fix your PR have been squashed to avoid history pollution.
  • [ ] Tests have been added for new features or bug fixes.
  • [ ] API of new functions and classes are documented.

Description

@JohanMabille This is a skeleton for how to move simple operations to the GPU using a similar strategy as with XSIMD. I'm curious if this would be an extensible strategy. I know the code doesn't compile I have made many short-cuts to demonstrate the concept.

Points of Concern:

  • Containers are copied multiple times when referenced in multiple expressions rather than one immutable shadow copy.
    • GPU memory allocations and host - device transfers are expensive
  • Expressions are evaluated serially through the expression tree when multiple streams/thread could be used in a reduction tree.
  • Each batch is essentially a kernel launch which has overhead... ie. no kernel fusion. (this would require us to generate kernels with metatemplate code... This would likely require implementing the assignment operation as a kernel launch across a thread grid)
  • Currently proposing we use thrust or something similar from AMD/Intel which will have a cost as well but this eliminates the need to worry about launching kernels, streams and synchronization.
  • The current method 'dispatches' work from the host to a device in an opaque way. We could also create a gpu_container for the public interface and attempt to implement the assignment as a CUDA kernel.

https://github.com/xtensor-stack/xtensor/issues/192

spectre-ns avatar Jan 04 '25 18:01 spectre-ns

Hi, im quite interest in this idea, maybe we can work together? do you still interest in this idea?

Roy-Kid avatar Apr 29 '25 08:04 Roy-Kid