ml
ml copied to clipboard
link libnvblas?
libnvblas.so gets installed with the existing cuda libraries. Apparently this can be enabled as the drop-in BLAS library for R, and is smart enough to let openblas handle things and only take over when it can provide significant acceleration(?)
EDIT
Haven't found great documentation on setup or performance, but looks like this can be done as a one-off at runtime by setting LD_PRELOAD
and configuring the fallback to openblas:
## create config file:
echo "NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/libopenblas.so
NVBLAS_GPU_LIST ALL" > /etc/nvblas.conf
Run R with these env vars:
NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.9.0 R
Will have to benchmark a bit, but maybe worth adding this into our cuda/base setup @noamross ?
Will have to benchmark a bit
Seconded. We should definitely document that it is there, but I am not convinced it will always be a winner. Then again I am also often wrong when guessing :)
Yeah, it's not clear to me what the appropriate benchmark comparison is -- obviously the difference between a given operation on GPU vs CPU depends a lot on exactly what GPU vs what CPU you have on the platform.
That said, I imagine people will really only be deploying the rocker/cuda images on machines with significant GPUs available, if not on hardware explicitly optimized for GPU use (e.g. GPU-type instances on AWS). I do see some substantial improvement in low-level linear algebra operations, things like calculating determinant can see a factor of 10. For typical R use I doubt a lot of operations would see things like that, but then this image is already aimed at more specialized applications intended for GPU anyway.
Note in this experimental repo we have the cpu-based rocker/ml
as well as the rocker/ml-gpu
, only the latter builds on rocker/cuda
and would thus get the GPU blas. of course a lot the specialized ML packages (xgboost, h2o keras) are either already linking these libs (via their calls to python or java), or else doing other gpu-optimized algorithms, but having rocker/cuda
support GPU blas out of the box could make it a useful image to users where the GPU linear algebra is useful in contexts wholly apart from the ML packages.
Agreed that we should benchmark but in principle it seems a reasonable default for the cuda
-based images. If you have an experimental fork with a script I'll get to it on our hardware, and maybe others (@MarkEdmondson1234), can give it a go, too?
@noamross Thanks!
Yes, I think I have an experimental version of this on the nvblas
branch on the cuda/base/Dockerfile
. (Help testing would be great since I just had to send my System76 desktop with my GPU back to the shop for weird crashing behavior :-( ).
So one thing is that I'm following NVIDIA's advice to use LD_PRELOAD
instead of re-linking. Like they say, you don't want to set LD_PRELOAD
globally, since then it would get set before every shell command run on the system, so I cribbed this approach to load it just before the R
, Rscript
, and rserver
sessions:
https://github.com/rocker-org/ml/blob/87726cf095181d1340736f7e19ec8e8617132bdf/cuda/base/Dockerfile#L89-L105
I'm really not sure that's the best way to do this. If we're adding it to the library, it probably makes more sense to configure it directly as the system's blas, but I'd have to refresh on how to do that (particularly in a non-interactive session like the Dockerfile). @eddelbuettel has loads more experience with linking blas libraries and all and can probably give us some pointers (perhaps after recovering from the horror of seeing LD_PRELOAD
approach above?).
I did give this a quick run on my system before sending it back and the results were impressive for basic matrix multiplication and determinants, particularly compared to default (non-parallel) blas. For openblas it depended more on how many CPU threads and much memory was available to the CPU relative to your GPU, but notably it was never slower linking the GPU libraries (perhaps because the nvblas-conf file already links the openblas cpu libs as the fallback anyway). But could use more testing; and I haven't run this exact dockerfile yet (or run in the RStudio mode), I was just running interactively on the machine...
Sorry to hear about the crashes. Frustrating.
My experience with "plugging BLAS in and out" is/was limited to system others made that already supported it :) I.e. the Debian BLAS maintainer had this brilliant idea of using the interchangeable nature of BLAS/LAPACK along the 'dpkg-alterntatives' mechanism of setting and adjusting softlinks to really make it swappable. In that we could lean on that scheme and try to fold NVidia's BLAS into it.
Otherwise LD_PRELOAD
does the same: by rejigging the search order, you get your preferred BLAS in lieu of a default. So in that sense what you did here should do the trick.
Would be happy to do some benchmarking but would need some demo code to run as BLAS etc all over my head :)
Roughly a hundred years ago I did just that in what is now this repo using an existing R benchmark package / script. If memory serves then Colin's benchmarkme package uses the same. It all goes back to an original old script by Simon U. Can you start off that?
Looks good!
I'm getting:
Error response from daemon: Dockerfile parse error line 92: unknown
instruction: \NLD_PRELOAD=$CUDA_BLAS
when I run docker build .
me@mybox :~/test_docker/ml/cuda/base$ nvidia-smi
Wed Mar 13 10:18:27 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145 Driver Version:
384.145 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:03:00.0 Off |
N/A |
| 0% 47C P0 54W / 250W | 0MiB / 12188MiB | 2%
Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU
Memory |
| GPU PID Type Process name
Usage |
|=============================================================================|
| No running processes
found |
+-----------------------------------------------------------------------------+
me@mybox:~/test_docker/ml/cuda/base$ sudo docker version
Client:
Version:17.12.0-ce
API version:1.35
Go version:go1.9.2
Git commit:c97c6d6
Built:Wed Dec 27 20:11:19 2017
OS/Arch:linux/amd64
Server:
Engine:
Version:17.12.0-ce
API version:1.35 (minimum version 1.12)
Go version:go1.9.2
Git commit:c97c6d6
Built:Wed Dec 27 20:09:53 2017
OS/Arch:linux/amd64
Experimental:false
On Tue, Mar 12, 2019 at 3:55 PM Mark [email protected] wrote:
Looks good!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rocker-org/ml/issues/17#issuecomment-472156706, or mute the thread https://github.com/notifications/unsubscribe-auth/AFi1C-m2bFpGQKUPew5cK51QeUAn5ywuks5vWAYsgaJpZM4bmYjz .
@restonslacker whoops, that was just a typo in the Dockerfile (apparently you can't escape a literal !
while using double quotes for $VARS
....) should be fixed now
Hi, did you have any chance with the LE_PRELOAD and R? When I use this approach I can hardly engage the GPU.
This example should run on the GPU using our docker images (e.g. rocker/ml
) with NVIDIA BLAS.
Note that this is obviously hardware-dependent -- in particular, NVIDIA BLAS uses a configuration that enables a fall-back to CPU-BLAS if it decides the problem size is too large for the GPU. Also note that there's non-trivial overhead in moving the data from CPU to GPU, which can often swamp the time saved in the actual GPU-based computation.