ml link libnvblas?

trafficstars

libnvblas.so gets installed with the existing cuda libraries. Apparently this can be enabled as the drop-in BLAS library for R, and is smart enough to let openblas handle things and only take over when it can provide significant acceleration(?)

EDIT

Haven't found great documentation on setup or performance, but looks like this can be done as a one-off at runtime by setting LD_PRELOAD and configuring the fallback to openblas:

## create config file:
echo "NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/libopenblas.so
NVBLAS_GPU_LIST ALL" > /etc/nvblas.conf

Run R with these env vars:

NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.9.0 R

Will have to benchmark a bit, but maybe worth adding this into our cuda/base setup @noamross ?

Mar 09 '19 01:03 cboettig

Will have to benchmark a bit

Seconded. We should definitely document that it is there, but I am not convinced it will always be a winner. Then again I am also often wrong when guessing :)

Mar 09 '19 01:03 eddelbuettel

Yeah, it's not clear to me what the appropriate benchmark comparison is -- obviously the difference between a given operation on GPU vs CPU depends a lot on exactly what GPU vs what CPU you have on the platform.

That said, I imagine people will really only be deploying the rocker/cuda images on machines with significant GPUs available, if not on hardware explicitly optimized for GPU use (e.g. GPU-type instances on AWS). I do see some substantial improvement in low-level linear algebra operations, things like calculating determinant can see a factor of 10. For typical R use I doubt a lot of operations would see things like that, but then this image is already aimed at more specialized applications intended for GPU anyway.

Note in this experimental repo we have the cpu-based rocker/ml as well as the rocker/ml-gpu, only the latter builds on rocker/cuda and would thus get the GPU blas. of course a lot the specialized ML packages (xgboost, h2o keras) are either already linking these libs (via their calls to python or java), or else doing other gpu-optimized algorithms, but having rocker/cuda support GPU blas out of the box could make it a useful image to users where the GPU linear algebra is useful in contexts wholly apart from the ML packages.

Mar 09 '19 04:03 cboettig

Agreed that we should benchmark but in principle it seems a reasonable default for the cuda-based images. If you have an experimental fork with a script I'll get to it on our hardware, and maybe others (@MarkEdmondson1234), can give it a go, too?

Mar 12 '19 13:03 noamross

@noamross Thanks!

Yes, I think I have an experimental version of this on the nvblas branch on the cuda/base/Dockerfile. (Help testing would be great since I just had to send my System76 desktop with my GPU back to the shop for weird crashing behavior :-( ).

So one thing is that I'm following NVIDIA's advice to use LD_PRELOAD instead of re-linking. Like they say, you don't want to set LD_PRELOAD globally, since then it would get set before every shell command run on the system, so I cribbed this approach to load it just before the R, Rscript, and rserver sessions:

https://github.com/rocker-org/ml/blob/87726cf095181d1340736f7e19ec8e8617132bdf/cuda/base/Dockerfile#L89-L105

I'm really not sure that's the best way to do this. If we're adding it to the library, it probably makes more sense to configure it directly as the system's blas, but I'd have to refresh on how to do that (particularly in a non-interactive session like the Dockerfile). @eddelbuettel has loads more experience with linking blas libraries and all and can probably give us some pointers (perhaps after recovering from the horror of seeing LD_PRELOAD approach above?).

I did give this a quick run on my system before sending it back and the results were impressive for basic matrix multiplication and determinants, particularly compared to default (non-parallel) blas. For openblas it depended more on how many CPU threads and much memory was available to the CPU relative to your GPU, but notably it was never slower linking the GPU libraries (perhaps because the nvblas-conf file already links the openblas cpu libs as the fallback anyway). But could use more testing; and I haven't run this exact dockerfile yet (or run in the RStudio mode), I was just running interactively on the machine...

Mar 12 '19 18:03 cboettig

Sorry to hear about the crashes. Frustrating.

My experience with "plugging BLAS in and out" is/was limited to system others made that already supported it :) I.e. the Debian BLAS maintainer had this brilliant idea of using the interchangeable nature of BLAS/LAPACK along the 'dpkg-alterntatives' mechanism of setting and adjusting softlinks to really make it swappable. In that we could lean on that scheme and try to fold NVidia's BLAS into it.

Otherwise LD_PRELOAD does the same: by rejigging the search order, you get your preferred BLAS in lieu of a default. So in that sense what you did here should do the trick.

Mar 12 '19 19:03 eddelbuettel

Would be happy to do some benchmarking but would need some demo code to run as BLAS etc all over my head :)

Mar 12 '19 19:03 MarkEdmondson1234

Roughly a hundred years ago I did just that in what is now this repo using an existing R benchmark package / script. If memory serves then Colin's benchmarkme package uses the same. It all goes back to an original old script by Simon U. Can you start off that?

Mar 12 '19 19:03 eddelbuettel

Looks good!

Mar 12 '19 19:03 MarkEdmondson1234

I'm getting:

Error response from daemon: Dockerfile parse error line 92: unknown
instruction: \NLD_PRELOAD=$CUDA_BLAS

when I run docker build .

me@mybox :~/test_docker/ml/cuda/base$ nvidia-smi
Wed Mar 13 10:18:27 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145                Driver Version:
384.145                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 00000000:03:00.0 Off |
N/A |
|  0%   47C    P0    54W / 250W |      0MiB / 12188MiB |      2%
Default |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| Processes:                                                       GPU
Memory |
|  GPU       PID   Type   Process name
Usage      |
|=============================================================================|
|  No running processes
found                                                 |
+-----------------------------------------------------------------------------+
me@mybox:~/test_docker/ml/cuda/base$ sudo docker version
Client:
 Version:17.12.0-ce
 API version:1.35
 Go version:go1.9.2
 Git commit:c97c6d6
 Built:Wed Dec 27 20:11:19 2017
 OS/Arch:linux/amd64

Server:
 Engine:
  Version:17.12.0-ce
  API version:1.35 (minimum version 1.12)
  Go version:go1.9.2
  Git commit:c97c6d6
  Built:Wed Dec 27 20:09:53 2017
  OS/Arch:linux/amd64
  Experimental:false

On Tue, Mar 12, 2019 at 3:55 PM Mark [email protected] wrote:

Looks good!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rocker-org/ml/issues/17#issuecomment-472156706, or mute the thread https://github.com/notifications/unsubscribe-auth/AFi1C-m2bFpGQKUPew5cK51QeUAn5ywuks5vWAYsgaJpZM4bmYjz .

Mar 13 '19 10:03 restonslacker

@restonslacker whoops, that was just a typo in the Dockerfile (apparently you can't escape a literal ! while using double quotes for $VARS....) should be fixed now

Mar 13 '19 15:03 cboettig

Hi, did you have any chance with the LE_PRELOAD and R? When I use this approach I can hardly engage the GPU.

Jan 21 '21 16:01 lezwright

This example should run on the GPU using our docker images (e.g. rocker/ml) with NVIDIA BLAS.

Note that this is obviously hardware-dependent -- in particular, NVIDIA BLAS uses a configuration that enables a fall-back to CPU-BLAS if it decides the problem size is too large for the GPU. Also note that there's non-trivial overhead in moving the data from CPU to GPU, which can often swamp the time saved in the actual GPU-based computation.

Jan 21 '21 21:01 cboettig

ml ml copied to clipboard

link libnvblas?

ml
ml copied to clipboard