LLamaSharp using CUDA when both CPU and Cuda12 back-ends are present.

I'm using Kernel-memory with LLamaSharp. Despite having a RTX 3080 and the latest CUDA drivers installed, CUDA is not used.

Not sure if this is a bug or I'm missing something, so here's a question instead:

The LlamaSharp.csproj contains

 <PackageReference Include="LLamaSharp.Backend.Cpu"/>
 <PackageReference Include="LLamaSharp.Backend.Cuda12"/>

I found out that if both Cpu and Cuda12 back-ends are referenced, only the CPU is being used even if the CUDA DLL is loaded. Interestingly, the logs do say that the CUDA back-end is loaded, but no Cuda is used.

[LLamaSharp Native] [Info] Detected OS Platform: WINDOWS
[LLamaSharp Native] [Info] Detected cuda major version 12.
[LLamaSharp Native] [Info] ./runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully.
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from D:\Source\km\Models\llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2)
...

If I remove the reference to LLamaSharp.Backend.Cpu, then the CUDA back-end will start to be used. The logs show:

[LLamaSharp Native] [Info] Detected OS Platform: WINDOWS
[LLamaSharp Native] [Info] Detected cuda major version 12.
[LLamaSharp Native] [Info] ./runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully.
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from D:\Source\km\Models\llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2)
...

I've reported this to the kernel memory project, but was advised to report this here.

Jan 24 '24 05:01 vvdb-architecture

It seems that the CPU back-end and the Cuda back-ends can't be installed at the same time.

If this is by design, the issue can be closed, but since I don't know if this is by design, I'll leave the issue open for others to comment.

Jan 26 '24 15:01 vvdb-architecture

Originally they weren't meant to be installed together, since it then wasn't clear which binaries should be used. However we now have the runtime detection which should probe your system and load the best binaries possible.

In your case that looks like it is working, since the logs say:

[LLamaSharp Native] [Info] ./runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully.

But then for some reason that isn't actually using your GPU! I think this probably is a real bug

Jan 26 '24 15:01 martindevans

Hi @vvdb-architecture , if it is not using your GPU even with the Cuda backend do you have your GpuLayerCount in your ModelParams set to -1, or 1-33? If it is not set or set to 0 it will default to cpu-only, even with just the Cuda backend installed. Sorry if I misunderstand your problem, but this may help other users if they have that issue:

Feb 22 '24 18:02 adammikulis

Hi @vvdb-architecture , if it is not using your GPU even with the Cuda backend do you have your GpuLayerCount in your ModelParams set to -1, or 1-33? If it is not set or set to 0 it will default to cpu-only, even with just the Cuda backend installed. Sorry if I misunderstand your problem, but this may help other users if they have that issue:

It's set to 33.

Feb 23 '24 04:02 vvdb-architecture

I think this issue can be closed, since the docs explicitly states you can only install one of the back-ends.

Feb 23 '24 05:02 vvdb-architecture

@vvdb-architecture Sorry for seeing this issue late. It should be my duty to resolve this problem because I wrote the main part of dynamic loading of native library. #588 is also a duplication of this issue.

since the docs explicitly states you can only install one of the back-ends

Yes, but the document has been outdated for a long time. The document still stays at v0.5.0 now, while we are already proceeding to v0.11.0. In document I declared this state because dynamic loading is not supported in v0.5.0.

LLamaSharp is expected to work with multiple backend packages in current version, so I'll re-open this issue and dig on it. Thank you for your reminder in #589!

Mar 11 '24 11:03 AsakusaRinne

Hello, is there any news on this issue?

I encounter a similar issue. I have installed both LLamaSharp.Backend.CPU and LLamaSharp.Backend.Cuda12.Windows (0.18.0 versions). Following th README I added the following line to show show which native library file is loaded

NativeLibraryConfig.Instance.WithLogCallback(delegate (LLamaLogLevel level, string message) { Console.Write($"{level}: {message}"); } )

When I load a model on the CPU with GpuLayerCount equals to 0, the cuda backend is loaded

- PreferCuda: True
- PreferVulkan: True
- PreferredAvxLevel: AVX2
- AllowFallback: True
- SkipCheck: False
- SearchDirectories and Priorities: { ./ }
Info: NativeLibraryConfig Description:
- LibraryName: LLama
- Path: ''
- PreferCuda: True
- PreferVulkan: True
- PreferredAvxLevel: AVX2
- AllowFallback: True
- SkipCheck: False
- SearchDirectories and Priorities: { ./ }
Debug: Got relative library path 'runtimes/win-x64/native/cuda12/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: True, UseVulkan: False, AvxLevel: None), trying to load it...
Debug: Found full path file './runtimes/win-x64/native/cuda12/llama.dll' for relative path 'runtimes/win-x64/native/cuda12/llama.dll'
Info: Successfully loaded './runtimes/win-x64/native/cuda12/llama.dll'

When I only install LLamaSharp.Backend.CPU the correct native library file is loaded

- PreferCuda: True
- PreferVulkan: True
- PreferredAvxLevel: AVX2
- AllowFallback: True
- SkipCheck: False
- SearchDirectories and Priorities: { ./ }
Debug: Got relative library path 'runtimes/win-x64/native/cuda12/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: True, UseVulkan: False, AvxLevel: None), trying to load it...
Debug: Found full path file 'runtimes/win-x64/native/cuda12/llama.dll' for relative path 'runtimes/win-x64/native/cuda12/llama.dll'
Info: Failed Loading 'runtimes/win-x64/native/cuda12/llama.dll'
Debug: Got relative library path 'runtimes/win-x64/native/vulkan/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: False, UseVulkan: True, AvxLevel: None), trying to load it...
Debug: Found full path file 'runtimes/win-x64/native/vulkan/llama.dll' for relative path 'runtimes/win-x64/native/vulkan/llama.dll'
Info: Failed Loading 'runtimes/win-x64/native/vulkan/llama.dll'
Debug: Got relative library path 'runtimes/win-x64/native/avx2/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: False, UseVulkan: False, AvxLevel: Avx2), trying to load it...
Debug: Found full path file './runtimes/win-x64/native/avx2/llama.dll' for relative path 'runtimes/win-x64/native/avx2/llama.dll'
Info: Successfully loaded './runtimes/win-x64/native/avx2/llama.dll'

Oct 24 '24 14:10 clarinevong

When I load a model on the CPU with GpuLayerCount equals to 0, the cuda backend is loaded

That's how it's meant to work - if the CUDA binaries are available and compatible with your system they will be used unless you explicitly disable CUDA at load time with NativeLibraryConfig.All.WithCuda(false).

Changing GpuLayerCount changes how many layers are sent to the GPU, but does not change which backend is used. Setting it to zero should be equivalent to not using CUDA at all (although possibly slightly slower than the pure CPU binaries).

Oct 24 '24 15:10 martindevans

I'm wondering if this bug is still manifesting. Using latest 0.18.0 with CPU and Cuda12 backends installed, it defaults to CPU no matter what settings I specify. Here is what I am trying:

NativeLibraryConfig.All.WithCuda(true) and GpuLayerCount=33 => CPU
No NativeLibraryConfig parameters and GpuLayerCount=33 => CPU
GPULayerCount=0 =>CPU
GPULayerCount=-1 =>CPU
Removing CPU backend => GPU

The only way I can get the Cuda12 backend to work is by removing CPU, which is contrary to the documentation ("Please note that before LLamaSharp v0.10.0, only one backend package should be installed at a time.").

Oct 30 '24 17:10 stonstad

What does the log call back show for you (see this comment just above for how to add the callback)?

Oct 30 '24 17:10 martindevans

Certainly!

Dependencies

Output GPU and CPU Backend.txt

I tested with NativeLibraryConfig.All.WithCuda(true) and just default settings -- the same settings which work if only the Cuda12 backend is installed.

Oct 30 '24 21:10 stonstad

There's something a bit odd going on here, your log shows that it tried to load things in this order:

Info: Failed Loading 'runtimes/win-x64/native/vulkan/llama.dll' Info: Successfully loaded './runtimes/win-x64/native/avx512/llama.dll'

So it tried to load a GPU backend and then tried to load a CPU backend when it couldn't do that, which is what we'd expect except that it was trying to load vulkan!

I'm at a bit of a loss what's going on here. @m0nsky any ideas?

Oct 30 '24 21:10 martindevans

Here's the associated code which manifests the bugged behavior.

using LLama.Common;
using LLama;
using LLama.Sampling;
using LLama.Native;
using System.Diagnostics;

namespace LLM
{
    internal class Program
    {
        public static async Task Main(string[] args)
        {
            NativeLibraryConfig.All.WithLogCallback((LLamaLogLevel level, string message) =>
            {
                Console.Write($"{level}: {message}");
            });

            NativeLibraryConfig.All.WithCuda(true);

            string modelPath = @"Meta-Llama-3.1-8B-Instruct-Q6_K.gguf";

            var parameters = new ModelParams(modelPath)
            {
                ContextSize = 8192, // The longest length of chat as memory.
                GpuLayerCount = 33, // How many layers to offload to GPU. Please adjust it according to your GPU memory.
            };
            using var model = LLamaWeights.LoadFromFile(parameters);
            using var context = model.CreateContext(parameters);
            var executor = new InteractiveExecutor(context);

            // Add chat histories as prompt to tell AI how to act.
            var chatHistory = new ChatHistory();
            chatHistory.AddMessage(AuthorRole.System, "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.");
            chatHistory.AddMessage(AuthorRole.User, "Hello, Bob.");
            chatHistory.AddMessage(AuthorRole.Assistant, "Hello. How may I help you today?");

            ChatSession session = new(executor, chatHistory);

            InferenceParams inferenceParams = new InferenceParams()
            {
                MaxTokens = 2048, // No more than 256 tokens should appear in answer. Remove it if antiprompt is enough for control.
                AntiPrompts = new List<string> { "User:" }, // Stop generation once antiprompts appear.
                SamplingPipeline = new DefaultSamplingPipeline(),
            };

            Console.ForegroundColor = ConsoleColor.Yellow;
            Console.Write("Test Started: ");

            Console.ForegroundColor = ConsoleColor.Green;
            string userInput = "Write out the numbers from 1 to 100. e.g. one, two, three ...";
            Console.WriteLine(userInput);

            Stopwatch stopWatch = new Stopwatch();
            stopWatch.Start();

            var tokens = session.ChatAsync(new ChatHistory.Message(AuthorRole.User, userInput), inferenceParams);
            await foreach (var token in tokens)
            {
                Console.ForegroundColor = ConsoleColor.White;
                Console.Write(token);
            }

            stopWatch.Stop();
            Console.WriteLine();
            Console.WriteLine(stopWatch.ElapsedMilliseconds + "ms");
        }
    }
}

Nov 05 '24 14:11 stonstad

When I load a model on the CPU with GpuLayerCount equals to 0, the cuda backend is loaded

That's how it's meant to work - if the CUDA binaries are available and compatible with your system they will be used unless you explicitly disable CUDA at load time with NativeLibraryConfig.All.WithCuda(false).

Changing GpuLayerCount changes how many layers are sent to the GPU, but does not change which backend is used. Setting it to zero should be equivalent to not using CUDA at all (although possibly slightly slower than the pure CPU binaries).

Thank you for your response @martindevans! I have a follow up question when I set GpuLayerCount to zero (I have a GPU and the CUDA backend is loaded), when I serve a model I noticed that some memory was allocated on my GPU. I suppose that this is the allocation that happens when creating a LLamaContext.

Is this considered normal when GpuLayerCount is set to zero?

Nov 05 '24 17:11 clarinevong

I'm not sure about that, ideally there probably shouldn't be any memory used, but I can easily imagine some resources getting created even when they're not technically needed. I'd recommend asking it upstream in the llama.cpp repo, they'll know more about the details.

Nov 05 '24 18:11 martindevans

LLamaSharp LLamaSharp copied to clipboard

using CUDA when both CPU and Cuda12 back-ends are present.

LLamaSharp
LLamaSharp copied to clipboard