ollama Add support for libcudart.so for CUDA devices (Adds Jetson support)

Added libcudart.so support to gpu.go for CUDA devices that are missing libnvidia-ml.so. CUDA libraries split into nvml (libnvidia-ml.so) and cudart (libcudart.so), can work with either. Tested on Jetson device and on Windows 11 in WSL2.

Devices used to test: Jetson Orin Nano 8Gb Jetpack 5.1.2, L4T 35.4.1 CUDA 11-8 CUDA Capability Supported 8.7 Go version 1.26.1 Cmake 3.28.1 nvcc 11.8.89

AMD Ryzen 3950x NVidia RTX 3090ti WSL2 running Ubuntu 22.04 WSL CUDA Toolkit v12.3 installed

Edited for updates

Jan 30 '24 16:01 remy415

Resolves #1979

Jan 30 '24 16:01 remy415

@dhiltgen I don't know if you're the right contact for this, but I'm having issues getting the correct memory amounts for GetGPUInfo() on Jetsons. Since they are iGPU, the memory is shared with the system (8Gb in my case). The free memory reported by cudaGetMem and the memory reported by Sysinfo aren't necessarily even the correct free memory as the Jetsons use a portion of RAM as flexible cache. There is a semi-accurate way to get "available memory" but the only decent way I've seen to get that information is to run free -m or to read /proc/meminfo as the kernel has some fancy maths it does to give a semi-accurate reprensentation of available information.

The 'buff/cache' field and 'available' field aren't reported by sysinfo (or cudaGetMem), and even the "/usr/bin/free" binary does an fopen() call on /proc/meminfo. For now I'm just setting it to report the greater of cudaGetMem or sysinfo free memory as the current "free memory".

I read that the "available memory" field is considered the best guess for actual available memory according to git notes for meminfo.c: meminfo.c commit . However it requires parsing /proc/meminfo or calling '/usr/bin/free' which does the same thing.

Do you have any ideas for the best way to report this information to the application? I tried putting in some overhead but the Jetson kept falling back to CPU due to memory even though there was extra memory available in the cache.

Feb 04 '24 06:02 remy415

Changed this to a draft while working memory issues.

Feb 04 '24 06:02 remy415

@dhiltgen I think this version meets the criteria for step #1, what do you think?

Feb 12 '24 21:02 remy415

I have tested this PR on the following device:

Device used to test: Jetson AGX Orin Developer Kit 64GB Jetpack 6.0DP, L4T 36.2.0 CUDA 12.2.140 CUDA Capability Supported 8.7 Go version 1.21.6 Cmake 3.22.1 nvcc 12.2.140

CUDA libraries are detected and used, generation uses 100% GPU. After installation in /usr/loca/bin/ollama there were permission issues when starting it as a service under the ollama user. I don't think that has anything to do with the code on this branch though. Still looking into it in issue #1979 .

Feb 15 '24 17:02 jhkuperus

I propose a change to the file scripts/install.sh to make sure the ollama user is also added to the video group. On my Jetson, the system service needed this to be able to use the CUDA cores.

On line 87, where the ollama user is added to the render group, I propose we add these lines:

    if getent group video >/dev/null 2>&1; then
        status "Adding ollama user to video group..."
        $SUDO usermod -a -G video ollama
    fi

Feb 15 '24 18:02 jhkuperus

I propose a change to the file scripts/install.sh to make sure the ollama user is also added to the video group. On my Jetson, the system service needed this to be able to use the CUDA cores.

On line 87, where the ollama user is added to the render group, I propose we add these lines:
    if getent group video >/dev/null 2>&1; then
        status "Adding ollama user to video group..."
        $SUDO usermod -a -G video ollama
    fi

I just checked my own jetson deployment and the service for it, and I ran into the same issue with my Jetson. For some reason, it has both a render and a video group, and the service didn't work until the ollama user was added to the video group. I'll add logic for it in the script in my PR as part of the Jetson compatibility.

Feb 15 '24 18:02 remy415

I'm rewriting the NVIDIA-Jetson tutorial to match the situation after your PR is applied. I'll add it as a Gist here to see if we can also add that to the PR.

Feb 15 '24 18:02 jhkuperus

I'm rewriting the NVIDIA-Jetson tutorial to match the situation after your PR is applied. I'll add it as a Gist here to see if we can also add that to the PR.

I've automated most of the things in the build process for getting this to work on Jetson to where if you have your lib paths properly updated you should be able to just pull it, go generate ./... && go build . and be ready to go. I'm waiting for feedback from @dhiltgen before inquiring about whether their backend build process supports the Jetson-specific changes or if there needs to be a Jetson-specific binary on top of MacOS/Windows/WSL/Linux_x64/Linux_aarch64. The main driver for this is the standard CUDA build adds AVX/AVX2 support, but AVX/AVX2 support breaks ARM compatibility. At the same time, do we really want to add the additional overhead of including a "CUDA with AVX/AVX2" and a "CUDA without AVX/AVX2" by default?

Feb 15 '24 18:02 remy415

@remy415 thanks! I'll try to take a look within the next few days. (I've been a bit distracted with the imminent Windows release)

Feb 15 '24 18:02 dhiltgen

@remy415 thanks! I'll try to take a look within the next few days. (I've been a bit distracted with the imminent Windows release)

Oh I completely understand, no rush from my side. Thank you for your help and support!

Feb 15 '24 18:02 remy415

@remy415 : Here's a suggestion to replace the docs/tutorials/nvidia-jetson.md file: https://github.com/jhkuperus/ollama/blob/edefca7ef3b1b13a8a60744b4511c48dd6e1b396/docs/tutorials/nvidia-jetson.md

Feb 15 '24 19:02 jhkuperus

@remy415 : Here's a suggestion to replace the docs/tutorials/nvidia-jetson.md file: https://github.com/jhkuperus/ollama/blob/edefca7ef3b1b13a8a60744b4511c48dd6e1b396/docs/tutorials/nvidia-jetson.md

Thank you for writing that up. I would advise on a couple things:

this PR is the first of 3 steps to begin loading the prepackaged shared libraries instead of querying the host. Once that is accomplished, the tutorial will be outdated.
on Jetson devices, CUDA toolkit is preinstalled. Also, the method for updating requires adding the Jetson specific nvidia repos. This will likely change again once JP6 is officially released as well.

Feb 15 '24 19:02 remy415

@dhiltgen My apologies for the giant commit spams on this, I'm trying to keep my branch updated with ollama main while integrating the libcudart changes.

I think this commit may fulfill the objective of adding libcudart support. Jetson users will possibly need to include environment variables on build, but given the nature of Jetson devices as development boards, I believe they should be equipped to do so anyway. I also included logic to disable AVX extensions in the CUDA build within gen_linux.sh if the architecture is arm64 as those chips don't support it in general.

Feb 20 '24 17:02 remy415

@remy415 let me know when you think this is in pretty good shape and I'll take another review pass through.

Feb 26 '24 17:02 dhiltgen

@dhiltgen I think it's in a pretty good place for step 1 of getting the libcudart support integrated. The only thing is it needs to be tested on machines with multiple GPUs to see if the code for the meminfo lookup works correctly on those systems (and I'd need to swap the priority for nvml/cudart as I put it back to loading nvml first).

Feb 27 '24 14:02 remy415

I pulled the newest version of this PR, ran go clean, go generate ./.. and go build again, but the new binary crashes when loading any model. Here's the first bit of logging where it segfaults:

Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.254+01:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.255+01:00 level=INFO source=gpu.go:200 msg="[libcudart.so] CUDART CUDA Compute Capability detected: 8.7"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.255+01:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.255+01:00 level=INFO source=gpu.go:200 msg="[libcudart.so] CUDART CUDA Compute Capability detected: 8.7"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.255+01:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.311+01:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama2209804735/cuda_v12/libext_server.so"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: time=2024-03-06T09:32:13.311+01:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
Mar 06 09:32:13 yoinkee-1 ollama[205621]: SIGSEGV: segmentation violation
Mar 06 09:32:13 yoinkee-1 ollama[205621]: PC=0xfffefc285928 m=16 sigcode=1
Mar 06 09:32:13 yoinkee-1 ollama[205621]: signal arrived during cgo execution
Mar 06 09:32:13 yoinkee-1 ollama[205621]: goroutine 38 [syscall]:
Mar 06 09:32:13 yoinkee-1 ollama[205621]: runtime.cgocall(0x944740, 0x40004ce698)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /usr/local/go/src/runtime/cgocall.go:157 +0x44 fp=0x40004ce660 sp=0x40004ce620 pc=0x407e94
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/llm._Cfunc_dyn_llama_server_init({0xfffed4003f50, 0xfffefc285550, 0xfffefc277b40, 0xfffefc279100, 0xfffefc294204, 0xfffefc282ba4, 0xfffefc2>
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         _cgo_gotypes.go:286 +0x30 fp=0x40004ce690 sp=0x40004ce660 pc=0x775810
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/llm.newDynExtServer.func7(0xa77804?, 0xc?)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/llm/dyn_ext_server.go:153 +0xe0 fp=0x40004ce780 sp=0x40004ce690 pc=0x776a60
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/llm.newDynExtServer({0x40004a8db0, 0x2f}, {0x400059f3b0, 0x65}, {0x0, 0x0, _}, {_, _, _}, ...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/llm/dyn_ext_server.go:153 +0x904 fp=0x40004cea20 sp=0x40004ce780 pc=0x776774
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/llm.newLlmServer({{0xf5796b000, 0xcdbc03000, 0x1}, {_, _}, {_, _}}, {_, _}, {_, ...}, ...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/llm/llm.go:158 +0x308 fp=0x40004cebe0 sp=0x40004cea20 pc=0x7737e8
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/llm.New({0x4000304900, 0x15}, {0x400059f3b0, 0x65}, {0x0, 0x0, _}, {_, _, _}, ...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/llm/llm.go:123 +0x4fc fp=0x40004cee60 sp=0x40004cebe0 pc=0x77332c
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/server.load(0x4000179680?, 0x4000179680, {{0x0, 0x800, 0x200, 0x1, 0xffffffffffffffff, 0x0, 0x0, 0x1, ...}, ...}, ...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/server/routes.go:85 +0x308 fp=0x40004cefe0 sp=0x40004cee60 pc=0x922f28
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/server.ChatHandler(0x400058c400)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/server/routes.go:1175 +0x8fc fp=0x40004cf720 sp=0x40004cefe0 pc=0x92cbdc
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/gin-gonic/gin.(*Context).Next(...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /root/go/pkg/mod/github.com/gin-gonic/[email protected]/context.go:174
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/jmorganca/ollama/server.(*Server).GenerateRoutes.func1(0x400058c400)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /mnt/data/forked_ollama/ollama/server/routes.go:945 +0x78 fp=0x40004cf760 sp=0x40004cf720 pc=0x92b638
Mar 06 09:32:13 yoinkee-1 ollama[205621]: github.com/gin-gonic/gin.(*Context).Next(...)
Mar 06 09:32:13 yoinkee-1 ollama[205621]:         /root/go/pkg/mod/github.com/gin-gonic/[email protected]/context.go:174

Not sure what the problem is though.

Mar 06 '24 08:03 jhkuperus

@jhkuperus Which Jetson device do you have and which Jetpack version you are running? I haven’t tested it with JP6 yet. If you’re not running JP6, you should clean your installation as JP5 and below are not compatible with CUDA 12. Ensure you aren’t installing any new video drivers or any new CUDA toolkit software, other than the 11.8 compatibility as referenced on the Tegra support pages (it should work without this though).

In the mean time I’ll run some tests and see if I messed up this last merge.

Mar 06 '24 12:03 remy415

@jhkuperus I executed a fresh build on my Jetson Orin Nano running JP5 and didn't run into the same segfault issue you had. I would guess your issue is either something present in JP6 that I haven't tested yet, or something like the JP5 + CUDA12 issue I referenced earlier.

Mar 06 '24 14:03 remy415

Model: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.0 DP [L4T 36.2.0]

Module: NVIDIA Jetson AGX Orin (64GB ram) Libraries:
CUDA: 12.2.140
cuDNN: 8.9.4.25
TensorRT: 8.6.2.3
VPI: 3.0.10
Vulkan: 1.3.204
OpenCV: 4.8.0 - with CUDA: NO

I tested an earlier build from this PR a week or two ago. That one works just fine. Tell me if running it with some sort of verbose-flags or debugging will help you find out what the problem is. I'll gladly help.

Mar 06 '24 16:03 jhkuperus

@jhkuperus okay so you are running JP6, really odd that an earlier build works when this doesn’t as the only substantial changes I’ve made are to sync it with upstream. When you say you pulled the newest version, did you clone it to a fresh folder or did you git pull on top of the existing folder? Try deleting the entire ollama folder and clone it again.

Additionally, if you’re running it manually with ./ollama serve, all you need to do is export OLLAMA_DEBUG=1 to turn on verbose debugging.

Alternatively, JP6 is supposed to include libnvidia-ml.so support according to dustynv. I haven’t looked through it myself but if that’s true then the binary straight from Ollama may work.

Mar 06 '24 16:03 remy415

I removed the entire repo, checked it out anew and ran the build again. It's working and I also know where my snag was. I ran go generate ./.. instead of go generate ./.... I now have a version that is also capable of running Gemma. Forget my segfault-notice earlier, the branch is still working fine under JP6 with Cuda12.

Mar 06 '24 17:03 jhkuperus

@dhiltgen I think I may have figured out the issue with having to set specific architectures on Jetson devices.

It may be related to an upstream llama.cpp issue fixed here, and somewhat explained here:

There are a couple patches applied to the legacy GGML fork: fixed __fp16 typedef in llama.h on ARM64 (use half with NVCC) parsing of BOS/EOS tokens (see https://github.com/ggerganov/llama.cpp/pull/1931)

Seems like the issue is related to fp16 typedef in llama.h for ARM64 platforms. Confirmed (somewhat) when I unset CMAKE_CUDA_ARCHITECTURES (thus compiling the default), and set "-DLLAMA_CUDA_F16=off". It compiled (though took quite a bit longer than it did previously); I think either turning off the f16 cuda option in the gen_linux.sh file or including the referenced patch from dustynv should shore this issue up and reduce overall changes. I'll defer to your judgement on this one.

Mar 06 '24 18:03 remy415

Update:

I've done more digging into the F16 issue. I'm not sure why my particular compiler is having this issue, but it would seem the crux of the problem is somewhat alluded to in a NVidia CUDA-8 Mixed Precision Guide:

CUFFT: FP16 computation requires a GPU with Compute Capability 5.3 or later (Maxwell architecture).

And on another github issues thread , they also reference that that's because when graphic card's sm version less than 6.0, they don't support fp16. I'm not sure that's 100% accurate, but it would seem that sm < 6.0 has shaky fp16 support at best.

Also NVidia CUDA Programming Guide has some references: The 32-bit __half2 floating-point version of atomicAdd() is only supported by devices of compute capability 6.x and higher. The 16-bit __half floating-point version of atomicAdd() is only supported by devices of compute capability 7.x and higher.

Additionally, when I tried to compile with LLAMA_CUDA_F16=on and CMAKE_CUDA_ARCHITECTURES="50;52;61;70;75;80", I receive this error:

/home/tegra/ok3d/ollama-container/dev/ollama/llm/llama.cpp/ggml-cuda.cu(6324): error: more than one conversion function from "__half" to a built-in type applies:
            function "__half::operator float() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(204): here
            function "__half::operator short() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(222): here
            function "__half::operator unsigned short() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(225): here
            function "__half::operator int() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(228): here
            function "__half::operator unsigned int() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(231): here
            function "__half::operator long long() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(234): here
            function "__half::operator unsigned long long() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(237): here
            function "__half::operator __nv_bool() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(241): here

This error goes away if I change CMAKE_CUDA_ARCHITECTURES="61;70;75;80" and everything works swimmingly.

@dhiltgen do you know why -DLLAMA_CUDA_FORCE_MMQ=on was enabled? I was under the impression that it was preferred to not force MMQ which will enable Tensor cores to be used; compile & GPU execute worked fine either way for me, I haven't tested performance though.

Mar 06 '24 21:03 remy415

It also seems like llama.cpp upstream changed they way they included __half support for ARM devices several times in the last few months: Aug 2023 PR Sep 2023 Commit Jan 2024 Issue Jan 2024 PR with Quote:

Jan 20, 2024: Thanks for the discussion - IMO the fundamental issue is that ggml_fp16_t is exposed through the public ggml API in the first place. It's something to fix in the future, but for now will merge this workaround

Feb 2024 Commit

Mar 06 '24 21:03 remy415

This error goes away if I change CMAKE_CUDA_ARCHITECTURES="61;70;75;80" and everything works swimmingly.

Just to clarify: If CUDA Architecture is "50;51", setting LLAMA_CUDA_F16=off allows it to compile. CMAKE_CUDA_ARCHITECTURES="61;70;75;80" properly supports CUDA_F16.

Mar 06 '24 21:03 remy415

do you know why -DLLAMA_CUDA_FORCE_MMQ=on was enabled?

This was needed to add support for older GPUs and based on the testing we did at the time, didn't seem to have a major performance impact for newer GPUs.

For Jetson support Compute Capability 5.0 support isn't relevant as far as I know, so this flag can be omitted.

Mar 06 '24 22:03 dhiltgen

Okay so I guess the takeaway is for ARM-based CUDA builds, leave architectures at default and disable f16 support, and it should be golden.

Mar 06 '24 22:03 remy415

@dhiltgen I merged the PR with the latest Ollama release removing most of the AMD code from 'gpu.go'. I tested the build on my ARM and WSL+CUDA setups, and it looks like it's still good to go.

I also adjusted the memory overhead section to better align with the original code by setting overhead to 0 if the L4T env variable is detected. I know it would be preferred to leave in an overhead buffer, but the L4T OS automatic caching messes with the reported free memory (it is reported lower than it actually is) so sometimes when loading 7B models, the free memory reported is less than is needed by the model loader and it falls back to CPU. Maybe we can add a flag somewhere to just have the user manually disable overhead buffer assignment and leave it on by default? Not sure how to handle this.

Anyway, other than that the PR is ready for review with the latest release merged.

Mar 08 '24 16:03 remy415

Great job!

On Fri, Mar 8, 2024 at 11:34 AM Jeremy @.***> wrote:

@dhiltgen https://github.com/dhiltgen I merged the PR with the latest Ollama release removing most of the AMD code from 'gpu.go'. I tested the build on my ARM and WSL+CUDA setups, and it looks like it's still good to go.

I also adjusted the memory overhead section to better align with the original code by setting overhead to 0 if the L4T env variable is detected. I know it would be preferred to leave in an overhead buffer, but the L4T OS automatic caching messes with the reported free memory (it is reported lower than it actually is) so sometimes when loading 7B models, the free memory reported is less than is needed by the model loader and it falls back to CPU. Maybe we can add a flag somewhere to just have the user manually disable overhead buffer assignment and leave it on by default? Not sure how to handle this.

Anyway, other than that the PR is ready for review with the latest release merged.

— Reply to this email directly, view it on GitHub https://github.com/ollama/ollama/pull/2279#issuecomment-1986014388, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARYO7J2ATT5QASJMPRVSXF3YXHR7TAVCNFSM6AAAAABCRS3NICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGAYTIMZYHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Mar 08 '24 16:03 davidtheITguy