pdl Support also doing processing on GPUs

Support also doing processing on GPUs

Open mohawk2 opened this issue 2 years ago • 26 comments

This was discussed on pdl-devel back in Dec 2015 (see https://sourceforge.net/p/pdl/mailman/pdl-devel/?viewmonth=201512). There is also some discussion in Jun this year on PerlMonks (https://www.perlmonks.org/?node_id=11134476).

I recently revisited this; my thinking so far is that the best platform for mixed CPU/GPU processing seems to be OpenCL. However, OpenCL C doesn't support native-complex processing, but it would be possible to use OpenCL C++ with a class that overloads the arithmetic operators for source compatibility with existing Code; an easy first-cut alternative would be to only support GPU use on real types (which may also be further restricted to not supporting double for <OpenCL 2.0 environments - GPUs have long been aimed at speed rather than extreme precision, because of graphics' needs).

Defining some terms: "kernel" is a term in GPGPU-land for the code that actually runs in the GPU, typically being run once per "work item", which is basically identical to the "inner loop" inside a broadcastloop %{ /* blah */ %} (and/or loop). Current PDL operations have code generated by PP that doesn't distinguish between the "kernel" and the setup code (which runs on the "host"), since the output is just a bunch of C to put inside the function. When that distinction starts to matter, it will need some work.

Things that would need doing to allow this:

split Code (etc) into kernel and setup code; another benefit of doing so would be to aid lazy-building ndarrays since PDL could be made to set up a set of very small ndarrays, call the kernel on that, then do something with the results, then throwing them away - an example might be to sumover a very long sequence using basically no memory
figure a way to allow calling at least one other different broadcastloop initiator (as called in PP-generated C code); for back-compat with current CPU-only it's not desirable to make PDL have a hard requirement of OpenCL to operate at all
implement dividing work into a task queue with 4x as many items as the number of processing-units as alluded to here (https://sourceforge.net/p/pdl/mailman/pdl-devel/?viewmonth=202109) (which will also help with parallelising non-deterministic tasks)
either figuring how to identify code that calls into GPU-unavailable C functions (e.g. LAPACK) and having that not use GPU, or figure the best way to manage opt-in / opt-out of GPU use - an easy way might be to simply try OpenCL-compiling the generated code, which undoubtedly will fail if any unsuitable function calls (or headers) are found

Resources:

https://github.com/CastXML/CastXMLSuperbuild which illustrates how to use CMake to build an external project (here, LLVM might be needed)
http://portablecl.org/docs/html/install.html shows how to install Portable OpenCL (POCL)
https://en.wikipedia.org/wiki/OpenCL#Example:_complex_number_arithmetic shows OpenCL C++ code to implement overloaded arithmetic operations for complex numbers
https://github.com/KhronosGroup/libclcxx OpenCL C++ libraries
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html the actual spec for OpenCL 3.0 C

Nov 14 '21 23:11 mohawk2

An alternative approach that may help is to use OpenMP, which has support for "offloading" work to GPUs (and the API for doing so may be less low-level than OpenCL).

One upside of this approach may be that it would be straightforward to just annotate code with OpenMP pragmas.

Resources:

https://gcc.gnu.org/wiki/Offloading (Ubuntu's GCC seems to have NVIDIA offloading support as standard config, but Strawberry Perl 5.32's does not)
https://stackoverflow.com/questions/66307810/openmp-runtime-does-not-sees-my-gpu-devices
https://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf presentation from 2018

Nov 15 '21 01:11 mohawk2

GH-hosted GitHub Actions-runners don't have GPUs. But emulators seem to exist that would allow CI of GPU offloading, since we aren't worried about performance, only checking basic functionality.

Resources:

https://srirajpaul.blogspot.com/2019/07/cuda-gpu-simulator-container.html
https://github.com/gpgpu-sim/gpgpu-sim_distribution/tree/dev
https://stackoverflow.com/questions/3087361/gpu-emulator-for-cuda-programming-without-the-hardware
https://www.nersc.gov/assets/Uploads/Talk-GTC2019-ProgrammingModels.pdf

Nov 15 '21 02:11 mohawk2

A lot of scientists have Macs and there seems to be no good cross-platform solutions for this.

CUDA is NVIDIA-only and not supported on MacOS. OpenCL is deprecated in favour of Metal which is Apple-only

I don’t know if there is a good solution?

Karl

On 15 Nov 2021, at 12:41 pm, mohawk2 @.***> wrote:

An alternative approach that may help is to use OpenMP, which has support for "offloading" work to GPUs (and the API for doing so may be less low-level than OpenCL).

One upside of this approach may be that it would be straightforward to just annotate code with OpenMP pragmas.

Resources:

https://gcc.gnu.org/wiki/Offloading https://gcc.gnu.org/wiki/Offloading (Ubuntu's GCC seems to have NVIDIA offloading support as standard config, but Strawberry Perl 5.32's does not) https://stackoverflow.com/questions/66307810/openmp-runtime-does-not-sees-my-gpu-devices https://stackoverflow.com/questions/66307810/openmp-runtime-does-not-sees-my-gpu-devices https://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf https://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf presentation from 2018 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PDLPorters/pdl/issues/349#issuecomment-968433879, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADU7FGT2LKCVB6DL6HQJXHTUMBQLLANCNFSM5IAM7H5Q. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Nov 15 '21 04:11 karlglazebrook

PDL today (as of 2.059) on Macs will automatically use all available CPU cores using POSIX threads. That's not terrible already.

To use OpenMP on Mac, these are apparently effective instructions: https://stackoverflow.com/questions/43555410/enable-openmp-support-in-clang-in-mac-os-x-sierra-mojave (alternative: https://mac.r-project.org/openmp/#do). While there isn't yet support for OpenMP offloading to GPU on Mac, this recent discussion suggests such may not be far away: https://groups.google.com/g/llvm-dev/c/l45OcKvt_0w

Otherwise, given how relatively expensive Mac hardware is for the power you get, you might consider a PC gaming laptop (for powerful CPU and GPU) and run Linux on it ;-)

Nov 15 '21 05:11 mohawk2

Sure troll me!

I think the recent benchmarks have shown the Mac laptops smoking the Intel ones, and way way better per watt. I have one and that is my experience too, and it wasn’t that expensive!

Karl

On 15 Nov 2021, at 4:55 pm, mohawk2 @.***> wrote:

PDL today (as of 2.059) on Macs will automatically use all available CPU cores using POSIX threads. That's not terrible already.

To use OpenMP on Mac, these are apparently effective instructions: https://stackoverflow.com/questions/43555410/enable-openmp-support-in-clang-in-mac-os-x-sierra-mojave https://stackoverflow.com/questions/43555410/enable-openmp-support-in-clang-in-mac-os-x-sierra-mojave (alternative: https://mac.r-project.org/openmp/#do https://mac.r-project.org/openmp/#do). While there isn't yet support for OpenMP offloading to GPU on Mac, this recent discussion suggests such may not be far away: https://groups.google.com/g/llvm-dev/c/l45OcKvt_0w https://groups.google.com/g/llvm-dev/c/l45OcKvt_0w Otherwise, given how relatively expensive Mac hardware is for the power you get, you might consider a PC gaming laptop (for powerful CPU and GPU) and run Linux on it ;-)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PDLPorters/pdl/issues/349#issuecomment-968561778, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADU7FGRYTAX2JG3EPABNZYLUMCOETANCNFSM5IAM7H5Q. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Nov 15 '21 07:11 karlglazebrook

This shows a proper analysis of M1 for science: https://github.com/neurolabusc/AppleSiliconForNeuroimaging - for me, a bit of a bombshell is that Metal doesn't do double precision, which seems to limit its value for scientific purposes.

My reading indicates you're 100% right on M1 being vastly better per watt. My question: is it better per pound sterling?

Nov 15 '21 15:11 mohawk2

I will try and not get dragged further in to a ‘A is better than B’ debate :-), I will just say my own experience in one year with an M1 Macbook Air has been entirely positive. Everything runs faster and cooler, with much better battery life and I find the cost premium (20%?) a net benefit.

“Everything” still includes a lot of open source science code running under Rosetta (which all work fine BTW!), because it will take a while for the open source community to do architecture ports. The last seems esp. try of deep python stacks - still waiting on anaconda! On that note I do kind of think ‘how hard can it be’ as open source code should not in principle really care about your cpu, and I recompiled PDL myself just fine back in January and relatively easily.

Re Metal and single precision. Yes if that is right then a big ouch indeed! Hopefully it is an API issue that can be improved, and not a limitation of the underlying GPU. BTW Tensorflow has been ported to Metal, which is an example of scientific acceleration in practice.

On 16 Nov 2021, at 2:44 am, mohawk2 @.***> wrote:

This shows a proper analysis of M1 for science: https://github.com/neurolabusc/AppleSiliconForNeuroimaging https://github.com/neurolabusc/AppleSiliconForNeuroimaging - for me, a bit of a bombshell is that Metal doesn't do double precision, which seems to limit its value for scientific purposes.

My reading indicates you're 100% right on M1 being vastly better per watt. My question: is it better per pound sterling?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PDLPorters/pdl/issues/349#issuecomment-969039391, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADU7FGVXQB42ELP36QVSFXDUMETEDANCNFSM5IAM7H5Q. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Nov 17 '21 00:11 karlglazebrook

I will try and not get dragged further in to a ‘A is better than B’ debate :-), I will just say my own experience in one year with an M1 Macbook Air has been entirely positive.

Glad to hear it!

Re Metal and single precision. Yes if that is right then a big ouch indeed! Hopefully it is an API issue that can be improved, and not a limitation of the underlying GPU. BTW Tensorflow has been ported to Metal, which is an example of scientific acceleration in practice.

https://github.com/hashcat/hashcat/issues/2976 indicates that the Apple Silicon doesn't have double precision in hardware:

  Device Name                                     Apple M1
[...]
Double-precision Floating-point support         (n/a)

so there's no reason for Apple to make a Metal API for it. I think that implies that massive parallel double-precision stuff will be much faster on NVIDIA hardware, but unless the processing is highly CPU-intensive, the performance limits will still be of memory bandwidth. Which also has implications for how much benefit there is to adding GPGPU support to PDL (possibly not much).

An alternative approach to increasing locality for more complex processing might be to make JIT operator-building. This would take the transformations being applied to ndarrays, and combine them into a single, JIT-created, operator, which would do all the steps in each "broadcastloop" while the data is still in cache, thereby limiting how many round-trips data takes out of main RAM and back.

Nov 17 '21 02:11 mohawk2

On 17 Nov 2021, at 1:21 pm, mohawk2 @.***> wrote:

lad to hear it!

Re Metal and single precision. Yes if that is right then a big ouch indeed! Hopefully it is an API issue that can be improved, and not a limitation of the underlying GPU. BTW Tensorflow has been ported to Metal, which is an example of scientific acceleration in practice.

hashcat/hashcat#2976 https://github.com/hashcat/hashcat/issues/2976 indicates that the Apple Silicon doesn't have double precision in hardware:

Device Name Apple M1 [...] Double-precision Floating-point support (n/a) so there's no reason for Apple to make a Metal API for it. I think that implies that massive parallel double-precision stuff will be much faster on NVIDIA hardware, but unless the processing is highly CPU-intensive, the performance limits will still be of memory bandwidth. Which also has implications for how much benefit there is to adding GPGPU support to PDL (possibly not much).

Well this is OpenCL reporting that, and Apple now deprecates that, so it may not be accurate.

But you are right until the situation settles down there is no benefit for trying to use GPUs to accelerate our array math

An alternative approach to increasing locality for more complex processing might be to make JIT operator-building. This would take the transformations being applied to ndarrays, and combine them into a single, JIT-created, operator, which would do all the steps in each "threadloop" while the data is still in cache, thereby limiting how many round-trips data takes out of main RAM and back.

This is an idea that has been expressed before in this mailing list, and remains a good one

Karl

Nov 17 '21 02:11 karlglazebrook

Device Name Apple M1
[...]
Double-precision Floating-point support (n/a)

Well this is OpenCL reporting that, and Apple now deprecates that, so it may not be accurate. But you are right until the situation settles down there is no benefit for trying to use GPUs to accelerate our array math

Deprecating it doesn't mean they are not supporting it for some time to come. In particular, it in no way means the clinfo is incorrect. Even Apple aren't quite that arrogant.

Nov 17 '21 06:11 mohawk2

To expand a bit on the above-mentioned point about "kernels", @zmughal has correctly pointed out this is basically about "loop fusion".

Further thinking on that (as also discussed on IRC #native); PDL operations are referred to internally as "transformations". Tuomas's vision as I infer it from docs and code was there would be function-type ones as well as slice-type ones (EDIT in pdl.transtype there was also possibly a "retype"-type, but the field was never used so it was removed this year), though only slice-type ones were implemented. To achieve this loop-fusion there needs to be parsing of the code as supplied (PP already does this, so it could just be stored somewhere), then sellotaping the relevant loops together based on knowledge of the desired output. That knowledge would come from making function-type transforms a real thing, which would be done by lazy-evaluating more, such that a not-yet-evaluated multi-stage transform could, on evaluation, be queried for all its shape, and that could then be sellotaped together into one big bespoke transform.

The Wikipedia page linked above says that latest clang (12) and gcc (11.1) don't do redundant-allocation removal, and it seems to me that avoiding redundant allocation is very important. One way forward might be to use variable-length arrays (allocated on stack) so the compiler will know it doesn't necessarily ever need to go in main RAM.

Nov 18 '21 20:11 mohawk2

Actual measurement of performance with a simple loop-fusion, code:

use strict;
use warnings;
use Time::HiRes qw(gettimeofday tv_interval);
use PDL;
use Inline Pdlpp => 'DATA';

sub with_time (&) {
  my @t = gettimeofday();
  $_[0]->();
  printf "%g ms\n", tv_interval(\@t) * 1000;
}

$PDL::BIGPDL = $PDL::BIGPDL = 1;
my $N = $ARGV[0];
my ($a, $b, $c) = (ones($N), sequence($N), sequence($N));

print "with intermediates\n";
with_time { print +($a + $b * sin($c))->info, "\n" } for 1..5;
print "manual loop-fusion\n";
with_time { print PDL::a_plus_b_sin_c($a, $b, $c)->info, "\n" } for 1..5;

__DATA__
__Pdlpp__
pp_def('a_plus_b_sin_c',
  Pars => 'a(); b(); c(); [o]d()',
  GenericTypes => ['D'],
  Code => '$d() = $a() + $b() * sin($c());',
);

Results with $N of 60e6 (tuned to put real load on my setup, but not cause meaningful swapping), with the info removed:

with intermediates
1393.23 ms
1402.13 ms
1385.48 ms
1389.91 ms
1375.89 ms
manual loop-fusion
663.921 ms
646.778 ms
639.667 ms
641.872 ms
646.683 ms

In nice round numbers, the manually loop-fused operation above was a little over twice as fast as the naive version. This was also true with lower values of $N.

Nov 20 '21 22:11 mohawk2

I am merging #354 into this.

In an email in Dec 2015, Chris wrote things that could be incorporated in future PDL*:

ArrayFire GPGPU parallel computing: http://www.arrayfire.com/docs/index.htm This could give PDLng a quick start for high performance using GPUs.

COS, the C Object System: https://github.com/CObjectSystem/COS/blob/master/doc/cos-draft-dls09.pdf https://github.com/CObjectSystem/COS This could give PDLng an object oriented interface that maintains a C FFI so that PDLng could call or be called from other languages. Matplotlib anyone?

HDF5 Attributes: http://www.hdfgroup.org/HDF5/doc1.6/UG/13_Attributes.html I was thinking that our PDLng implementation should cover the bases from FITS, HDF5, named dimensions,.. Ingo has already done some work along these lines.

Dec 31 '21 02:12 mohawk2

Just been re-reading the COS docs. It is primarily about dynamic polymorphism, which I cannot see a great need for in any current or future PDL. The heart of PDL is to write some type-generic Code (with possible Bad, and/or Back analogues) which would do the right thing on any supported PDL types.

A need for PDL that I am starting to see is first-class dimensions that could be referred to in Code, which would allow dimension-updates to be loop-fused (and setdims etc to be a first-class operation rather than underlying API). That would imply RedoDims stuff to also be first-class, for the same reason. A thing that would help both RedoDims and BackCode would be a more mathematical (and potentially therefore reversible) way to express transformations. That suggests a more declarative style, which seems to be antithetical to the imperative style of C.

I am open to the idea that COS could be part of the solution.

Named dimensions (HDF5 etc style) would be a beneficial thing generally, and an "einops" style of expressing things might be part of that reversible dimension-translating.

It seems unlikely, but not impossible, that ArrayFire is going to help us much. However, they do support farming processing out to OpenCL and CUDA, so it could be an alternative to OpenMP. What would be needed is an Alien::ArrayFire similar to Alien::OpenMP, then a generalisation of broadcastloop to make our current pthread-only implementation be just one possibility. Given the language is a bit different from C, that might be challenging, and OpenMP/OpenCL would probably be easier. However, the generalisation would need to happen in any case. Also needed would be to store in the pdl_transvtable the Code etc, for accessing by OpenCL/ArrayFire/whatever.

Dec 31 '21 02:12 mohawk2

To expand a bit on the above-mentioned point about "kernels", @zmughal has correctly pointed out this is basically about "loop fusion".

Another optimisation to think about which isn't as well supported within PDL is "loop tiling/blocking". https://www.intel.com/content/www/us/en/developer/articles/technical/loop-optimizations-where-blocks-are-required.html. I noticed it being used as I was looking over the current matmult implementation. There are also cache-oblivious algorithms and libraries that implement more performant versions of these operations for particular sizes and kinds of matrices, so loop tiling may not be as necessary as loop fusion.

Jan 02 '22 01:01 zmughal

Another optimisation to think about which isn't as well supported within PDL is "loop tiling/blocking". https://www.intel.com/content/www/us/en/developer/articles/technical/loop-optimizations-where-blocks-are-required.html. I noticed it being used as I was looking over the current matmult implementation. There are also cache-oblivious algorithms and libraries that implement more performant versions of these operations for particular sizes and kinds of matrices, so loop tiling may not be as necessary as loop fusion.

I've read the matmult code in the past, and at the time (for me) it was quite eye-opening! This does bring in the idea of enhancing >1-dimensional loop() constructs to use tiling (probably with a check that the contained $x() lookups all use no specified index-lookups, to avoid any out-of-order sequencing causing problems).

I'm also reading https://www.alcf.anl.gov/sites/default/files/2020-01/OpenMP45_Bertoni.pdf which is a really good summary of OpenMP 4.5 constructs, especially the map stuff. It would be conceivable to use that with a bit of inserted code to calculate tiles, to insert OpenMP #pragmas to achieve tiling.

Jan 02 '22 02:01 mohawk2

Thinking about how to make axisvalues a proper transformation, including the same for ndcoords. What might help with the need to loop over arbitrary numbers of dims would be a new multidimloop directive, something like (with the benefit of any ndarray "access" benefiting from the "context" so being able to just be $x() instead of spelling out the dimensions in some way):

multidimloop(PARENT, n) %{
  $CHILD() = n;
%}

This doesn't yet help with a proper reshape transformation, though as that is pure Perl of a clump(-1) then a setdims, that just needs a first-class setdims.

A more generalised multiloop is not yet clear to me but seems necessary.

Jan 03 '22 01:01 mohawk2

#324 acts as a reminder once this gets closer to completion that PDL::Dataflow will want a thorough update since it still has stuff in there about "families" and some quite complicated stuff that I don't think would be very useful.

Jan 04 '22 18:01 mohawk2

Note to self: when (not if) we complete this, we'd update the PDL::PP docs to revise this:

generate code that takes advantage of the vectorising/parallel computing features of your machine (this a project for the future)

Feb 19 '22 10:02 mohawk2

Following on from a long-running branch on main PDL (index_ops) which has the bare start of slightly-more-wide implementation of "index operations" (extending the idea of indadd to multiplication, copying, etc) and some discussion on pdl-porters (https://www.mail-archive.com/[email protected]/msg05754.html). That links to the High-Performance Fortran (HPF) 2.0 spec (http://dacnet.rice.edu/versions/hpf2/hpf-v20.pdf) which on pages 91-97 (PDF pages 101-107) discusses "scatter-combine" operations. Those can be considered equivalent to using indexND with plus etc.

indexND is implemented using the range operation, which sort of cheats in PDL terms by not having any Pars at all. This is alluded to in the "Practical Magick" paper.

It occurs to me that indadd, and the above "index operations", and indexND, can be considered a generalisation of "broadcasting". That broadcasts the given operation over the full range of its broadcasted-up inputs. One consequence of this is that read (and writeback with two-way dataflow) routinely copy the whole broadcasted range, which as noted in the docs of range is wasteful.

What we could have in addition is "narrowcasting". It would generate additional read (and if needed writeback) code, possibly in different functions with additional index-array parameters (these might or might not want to be proper ndarrays). This would be parallel with the "vaffine" form of transformation which is a rectangular view onto another ndarray's actual data block.

Mar 29 '22 15:03 mohawk2

Another idea that occurred while pondering the above can be considered moving in the opposite direction; small and lightweight vs increasingly magnificent abstractions. Currently PDL operations (in particular the XS wrapper functions) if given non-broadcasting inputs, and/or single elements, and/or Perl scalars, do the full conversion of those to ndarrays, then set up a full pdl_trans, then call the full broadcasting apparatus.

This has been noted as being expensive (i.e. slow). An alternative would be to have the XS wrappers have some extra generated code which detects when the full machinery is not required, and interpolates a modified version of the fundamental code (or "kernel"). That could even allow for just using IV/NV without further conversion, and even allow translating $x(n=>n1,m=>m1) into lookups into n-dimensional Perl arrays-of-arrays. One could think of this as making PDL function as a sort of Perl Language for operating on Data (not very catchy, probably needs a better name).

This would allow PDL libraries to compete directly with e.g. https://metacpan.org/pod/Math::GSL

Mar 29 '22 15:03 mohawk2

As noted in the updated PDL::Dataflow and on this PerlMonks thread, it would be interesting to shift PDL's threading model from "wait while all the threads return" to "start them off then react to their completion" using an event loop.

Apr 21 '22 08:04 mohawk2

Loop fusion note: using which, index, where, etc should be dealt with appropriately inline, rather than actually generating the index sets in a naive way. Observation from https://perlmonks.org/?node_id=11144070

May 22 '22 01:05 mohawk2

Discussion of parsing Code chunks in order to support MPFR, etc, which might also facilitate translating into OpenCL: https://perlmonks.org/?node_id=11144472

Jun 07 '22 04:06 mohawk2

This might be a discussion for PDL::LinearAlgebra, but FYI in case it is relevant:

"The NVBLAS Library is built on top of the cuBLAS Library using only the CUBLASXT API (refer to the CUBLASXT API section of the cuBLAS Documentation for more details). NVBLAS also requires the presence of a CPU BLAS lirbary on the system. Currently NVBLAS intercepts only compute intensive BLAS Level-3 calls (see table below). Depending on the charateristics of those BLAS calls, NVBLAS will redirect the calls to the GPUs present in the system or to CPU. That decision is based on a simple heuristic that estimates if the BLAS call will execute for long enough to amortize the PCI transfers of the input and output data to the GPU"

[ https://docs.nvidia.com/cuda/nvblas/ ]

It currently supports these routines, all others are passed to the system BLAS:

gemm
syrk
herk
syr2k
her2k
trsm
trmm
symm
hemm

Nov 01 '22 00:11 KJ7LNW

As per discussion in IRC, the following are worth looking at:

Dec 26 '22 00:12 zmughal

pdl pdl copied to clipboard

Support also doing processing on GPUs

pdl
pdl copied to clipboard