xtensor GPU support

I want to start this topic just to discuss how GPU support could be introduced in the library.

SyCL (lack of [full opensource compiler support but great c++ integration. Low level optimizzation limits for kernels?).
CLCUDAPI (header only helper)
OCCA
Boost Compute (lack of cuda support)

/cc @randl @edgarriba

Mar 20 '17 11:03 bhack

See also section 3.4 in https://github.com/kokkos/array_ref/blob/master/proposals/P0331.rst

Apr 13 '17 11:04 bhack

Would be interesting, indeed

Nov 02 '17 12:11 feliwir

Note that we now have strong simd support for broadcasting and a number of other use cases, based on the xsimd project.

Nov 02 '17 12:11 SylvainCorlay

@SylvainCorlay SIMD is good and all, but the performance is not comparable to GPU acceleration. Especially for deep learning applications it makes a lot of sense to use this kind of acceleration. Since this library uses numpy as an api orientation i recommend looking at pytorch, which has similiar goals but uses GPU acceleration.

Nov 02 '17 14:11 feliwir

GPU is in scope and in the roadmap. I meamt that the work done for simd actually paved the way since a lot of the required logic in xtensor is the same.

Note that frameworks like pytorch don't implement compile-time loop unfolding like xtensor does which can make xtensor faster in complex expressions.

Nov 02 '17 14:11 SylvainCorlay

awesome, thanks for letting me know. How are you planning to implement it? Are contributions possible?

Nov 02 '17 15:11 feliwir

I recommend use MAGMA as the backend to support both GPU and CPU

Mar 11 '18 14:03 AuroraDysis

Just wondering if there are any updates on the topic. I'd love to make contributions where possible!

Jan 24 '19 07:01 ktnyt

We have GPU support on our roadmap for 2019. However, we're not yet sure how to do it concretely! So any input is highly appreciated. And of course, contributions are very welcome!

The thing we'd probably like to start out with is mapping a container to the GPU, and evaluating a simple binary function, such as A + B.

Jan 24 '19 08:01 wolfv

Thanks for the prompt reply!

I haven't been able to make a deep dive into the code yet, but I was thinking that the implementation strategy from a recently released library called ChainerX might be of help. It basically provides device agnostic NumPy like multi-dimensional arrays for C++. AFAIK they provide a Device abstract class that handles memory management and hides the hardware specific implementations for a core set of routines. This is just an idea but the Device specialization for CPU specific code can be can be developed in parallel to xtensor and when it is mature enough switch out the portions of code calling the synonymous routines. The GPU specialization can be filed in later and WIP routines can throw runtime or compile-time errors.

I'm not too familiar with the internals of xtensor so this might be an infeasible approach though.

Jan 24 '19 08:01 ktnyt

definitely a good idea to look at chainerx!

Am Do., 24. Jan. 2019 um 17:25 Uhr schrieb ktnyt [email protected]:

Thanks for the prompt reply!

I haven't been able to make a deep dive into the code yet, but I was thinking that the implementation strategy from a recently released library called ChainerX https://github.com/chainer/chainer/tree/master/chainerx_cc might be of help. It basically provides device agnostic NumPy like multi-dimensional arrays for C++. AFAIK they provide a Device abstract class that handles memory management and hides the hardware specific implementations for a core set of routines. This is just an idea but the Device specialization for CPU specific code can be can be developed in parallel to xtensor and when it is mature enough switch out the portions of code calling the synonymous routines. The GPU specialization can be filed in later and WIP routines can throw runtime or compile-time errors.

I'm not too familiar with the internals of xtensor so this might be an infeasible approach though.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/QuantStack/xtensor/issues/192#issuecomment-457108514, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2BPsBBpcRVFf9SXBkK3Lm5AVGmwLoRks5vGW3qgaJpZM4MiR_Y .

Jan 24 '19 08:01 wolfv

I'm also interested in this feature although not sure how to do it! As a starting point, would also point out (as a starting point) the ArrayFire package https://github.com/arrayfire/arrayfire. My loose understanding is instead of loop fusion they perform kernel fusion.

May 01 '19 19:05 miketoastmacneil

I think we should leave this still open as it isn't solved.

Feb 11 '20 23:02 wolfv

This would be very interesting. Any progress/news?

Mar 23 '20 14:03 fschlimb

I don't know if you could be interested in https://llvm.discourse.group/t/numpy-scipy-op-set/

Mar 23 '20 21:03 bhack

This would be very interesting. Any progress/news?

Not yet. We are trying to get some funding to start it.

Mar 27 '20 13:03 JohanMabille

This would be very interesting. Any progress/news?

Not yet. We are trying to get some funding to start it.

Any news on this subject?

Nov 25 '21 10:11 fschlimb

Is there any description of how you'd envision supporting GPUs, in particular through SYCL?

Nov 30 '21 17:11 fschlimb

Unfortunately no. We don't have any funding for implementing this

Nov 30 '21 21:11 JohanMabille

Why was this issue closed? xtensor does not yet have GPU support, does it?

Jun 13 '22 23:06 antoniojkim

I don't know why it was closed, but it should definitely stay opened until we can implement it.

Jun 24 '22 09:06 JohanMabille

are there any updates on a timeline for when xtensor might have GPU support?

Jun 24 '22 13:06 antoniojkim

Hey im a quant open to work on this, i have to research more about how library works and how map your containers to GPU.

What framework is better for this? I mean cuda could not because you want the BEST performance in ant GPUs.

Maybe open cl could work or another.

Also have you check the Nvidia implementarions for std::par and the integrations with? Make all easier but not sure if Will work on your library.

If in the background you have std vectors im pretty shure will work.

Jun 28 '22 00:06 Physicworld

@Physicworld i think the easiest / most portable solution would be to use SYCL

Jun 28 '22 07:06 feliwir

@Physicworld Sycl can be used with multiple backends with full or experimental support for NVidia, AMD and Intel. I think sycl (And CUDA) have partial if not complete GPU implementations of std::algorithms so that might be some low hanging fruit.

Jul 03 '22 16:07 spectre-ns

Given that sycl can run on the sycl host backend it would be ideal because all the xtensor call could be refactored into sycl then one implementation would work on host or GPU side with only a runtime toggle.

Jul 03 '22 16:07 spectre-ns

https://github.com/oneapi-src/oneAPI-samples

Jul 03 '22 17:07 spectre-ns

Thank for great lib. Any progress/news?

Sep 20 '23 08:09 ksvbka

Nope, we are still searching for funding to implement new features in xtensor.

Sep 20 '23 12:09 JohanMabille

xtensor xtensor copied to clipboard

GPU support

xtensor
xtensor copied to clipboard