mahout MAHOUT-1974 CUDA support

Initial PR for CUDA bindings support through JCuda

Apr 27 '17 03:04 nsakharnykh

Tests pass on my system:

Mahout JVM Sparse multiplication time: 1914 ms.
Mahout JCuda Sparse multiplication time: 195 ms.
- sparse mmul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .2.  5 runs
Mahout JVM Sparse multiplication time: 43 ms.
Mahout JCuda Sparse multiplication time: 11 ms.
- sparse mmul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .02.  5 runs
Mahout JVM Sparse multiplication time: 2 ms.
Mahout JCuda Sparse multiplication time: 1 ms.
- sparse mmul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .002.  5 runs
UserSetCUDATestSuite:
Mahout JVM Sparse multiplication time: 45 ms.
Mahout JCuda Sparse multiplication time: 10 ms.
User Defined sparse mmul at geometry of 1000 x 1000 %*% 1000 x 1000 density = 0.02 3 runs : 10 ms
- User Defined sparse mmul at geometry of 1000 x 1000 %*% 1000 x 1000 density = 0.02 3 runs

Apr 27 '17 18:04 andrewpalumbo

@nsakharnykh @rawkintrevo I intend to have dense hammered out on Sunday.

Apr 28 '17 18:04 andrewpalumbo

@nsakharnykh , @rawkintrevo, I ran out of time tonight to finish out dense %*% dense and dense %x% sparse; went down a rabbit hole woth the NVIDIA c api docs for cusparse. I noticed that JCuda supported only a single dense dense dgemm algorithm, with column major-matrices. Most mahout matrices are row-major, but i began considering the dense sparse multiplication, and was slightly thrown off by what seems to be required csr compression. it seems that sparse matrices should be compressed as csc since the. Anyways I ended up in the LAPACK fortran; apologies for not finishing it up tonight guys, I got off on a long tangent and ran out of time.

I pushed my beginning work up to my MAHOUT-1974 branch. Nothing really worth looking at right now, but I wil' make a PR against this when I get the densework together.

Regardless, I should have at least a quick n dirty version ready to go soon, while i work out what we'll need for experiments and benchmarking. We can still discuss and consider different SPARK configurations tomorrow with out dense cases. but I'd of course like to get this right.

As I mentioned on the last call we allow a "Sparse" DRM's in-core components to be both sparse and dense. Currently the threshold for conversion of a DRM block to be changed from a sparse to a dense matrix is pretty high (25% non zero estimate). In the future we will need to allow the user to set the sparsity somehow.

FYI: https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala#L431

May 01 '17 05:05 andrewpalumbo

@andrewpalumbo regarding column-major: yes, this is the default mode for CUBLAS, sorry I think I didn't mention it in my original email. There are a couple options we can exercise here. 1. We can use transposed versions of gemm routines if the input matrices are row-major. I think the output matrix will be always column-major so we'll have to transpose it by using geam if we want to keep it in a different format. 2. We can also keep the dense matrices in column-major format on the GPU and move between csc and csr formats for sparse matrices by using CUSPARSE conversion routines like csr2csc. There are also existing API functions in CUSPARSE to convert sparse to dense csr2dense and the other way around dense2csr. I think we should try to use the available conversion APIs from CUSPARSE as much as possible to avoid writing this on our own.

May 04 '17 15:05 nsakharnykh

@nsakharnykh I have my MAHOUT-1974 branch that is almost complete with dense, etc (less the column major issues. We'd discussed just making a PR against this. but It may be easiest if you just went ahead and pushed this to MAHOUT/CUDA, and then I'll make a PR against that, which will be public so that others may comment on it.

May 07 '17 22:05 andrewpalumbo

@nsakharnykh https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1974/cuda ^^ P.S. this is still WIP so there's alot of garbage in it..

May 07 '17 22:05 andrewpalumbo

@andrewpalumbo Ok, sounds good. I'll try to push what I have as soon as I have some time in front of my laptop. I'm currently at GTC so my schedule is a bit fragmented.

May 07 '17 22:05 nsakharnykh

Great, thanks. I figured you were there, and very busy, I'll keep working on my end, and there should be no (or few conflicts).. no rush, since my branch is based off of yours.

May 07 '17 22:05 andrewpalumbo

looking awesome @nsakharnykh @andrewpalumbo

Before merging, don't forget to fill out https://github.com/apache/mahout/blob/master/website/docs/native-solvers/cuda.md

May 08 '17 00:05 rawkintrevo

@rawkintrevo I asked @nsakharnykh to just go ahead and push this to the mahout/CUDA branch, since he's already up at GTC, and we're pushing this through as quickly as possible, and has spotty time to do this. I will immediately open up a [WIP] PR from my https://github.com/andrewpalumbo/mahout/tree/MAHOUT-1974/cuda branch (on top of his) and will fill out the md from there.

May 08 '17 02:05 andrewpalumbo

Just checking if we need to keep this PR open - I'm guessing this is already merged in feature branch: https://github.com/apache/mahout/tree/CUDA

Apr 23 '24 18:04 balashashanka