LLamaSharp
LLamaSharp copied to clipboard
using CUDA when both CPU and Cuda12 back-ends are present.
I'm using Kernel-memory with LLamaSharp. Despite having a RTX 3080 and the latest CUDA drivers installed, CUDA is not used.
Not sure if this is a bug or I'm missing something, so here's a question instead:
The LlamaSharp.csproj contains
<PackageReference Include="LLamaSharp.Backend.Cpu"/>
<PackageReference Include="LLamaSharp.Backend.Cuda12"/>
I found out that if both Cpu and Cuda12 back-ends are referenced, only the CPU is being used even if the CUDA DLL is loaded. Interestingly, the logs do say that the CUDA back-end is loaded, but no Cuda is used.
[LLamaSharp Native] [Info] Detected OS Platform: WINDOWS
[LLamaSharp Native] [Info] Detected cuda major version 12.
[LLamaSharp Native] [Info] ./runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully.
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from D:\Source\km\Models\llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2)
...
If I remove the reference to LLamaSharp.Backend.Cpu, then the CUDA back-end will start to be used. The logs show:
[LLamaSharp Native] [Info] Detected OS Platform: WINDOWS
[LLamaSharp Native] [Info] Detected cuda major version 12.
[LLamaSharp Native] [Info] ./runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully.
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from D:\Source\km\Models\llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2)
...
I've reported this to the kernel memory project, but was advised to report this here.
It seems that the CPU back-end and the Cuda back-ends can't be installed at the same time.
If this is by design, the issue can be closed, but since I don't know if this is by design, I'll leave the issue open for others to comment.
Originally they weren't meant to be installed together, since it then wasn't clear which binaries should be used. However we now have the runtime detection which should probe your system and load the best binaries possible.
In your case that looks like it is working, since the logs say:
[LLamaSharp Native] [Info] ./runtimes/win-x64/native/cuda12/libllama.dll is selected and loaded successfully.
But then for some reason that isn't actually using your GPU! I think this probably is a real bug
Hi @vvdb-architecture , if it is not using your GPU even with the Cuda backend do you have your GpuLayerCount in your ModelParams set to -1, or 1-33? If it is not set or set to 0 it will default to cpu-only, even with just the Cuda backend installed. Sorry if I misunderstand your problem, but this may help other users if they have that issue:
Hi @vvdb-architecture , if it is not using your GPU even with the Cuda backend do you have your GpuLayerCount in your ModelParams set to -1, or 1-33? If it is not set or set to 0 it will default to cpu-only, even with just the Cuda backend installed. Sorry if I misunderstand your problem, but this may help other users if they have that issue:
It's set to 33.
I think this issue can be closed, since the docs explicitly states you can only install one of the back-ends.
@vvdb-architecture Sorry for seeing this issue late. It should be my duty to resolve this problem because I wrote the main part of dynamic loading of native library. #588 is also a duplication of this issue.
since the docs explicitly states you can only install one of the back-ends
Yes, but the document has been outdated for a long time. The document still stays at v0.5.0 now, while we are already proceeding to v0.11.0. In document I declared this state because dynamic loading is not supported in v0.5.0.
LLamaSharp is expected to work with multiple backend packages in current version, so I'll re-open this issue and dig on it. Thank you for your reminder in #589!
Hello, is there any news on this issue?
I encounter a similar issue. I have installed both LLamaSharp.Backend.CPU
and LLamaSharp.Backend.Cuda12.Windows
(0.18.0 versions). Following th README I added the following line to show show which native library file is loaded
NativeLibraryConfig.Instance.WithLogCallback(delegate (LLamaLogLevel level, string message) { Console.Write($"{level}: {message}"); } )
When I load a model on the CPU with GpuLayerCount
equals to 0, the cuda backend is loaded
- PreferCuda: True
- PreferVulkan: True
- PreferredAvxLevel: AVX2
- AllowFallback: True
- SkipCheck: False
- SearchDirectories and Priorities: { ./ }
Info: NativeLibraryConfig Description:
- LibraryName: LLama
- Path: ''
- PreferCuda: True
- PreferVulkan: True
- PreferredAvxLevel: AVX2
- AllowFallback: True
- SkipCheck: False
- SearchDirectories and Priorities: { ./ }
Debug: Got relative library path 'runtimes/win-x64/native/cuda12/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: True, UseVulkan: False, AvxLevel: None), trying to load it...
Debug: Found full path file './runtimes/win-x64/native/cuda12/llama.dll' for relative path 'runtimes/win-x64/native/cuda12/llama.dll'
Info: Successfully loaded './runtimes/win-x64/native/cuda12/llama.dll'
When I only install LLamaSharp.Backend.CPU
the correct native library file is loaded
- PreferCuda: True
- PreferVulkan: True
- PreferredAvxLevel: AVX2
- AllowFallback: True
- SkipCheck: False
- SearchDirectories and Priorities: { ./ }
Debug: Got relative library path 'runtimes/win-x64/native/cuda12/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: True, UseVulkan: False, AvxLevel: None), trying to load it...
Debug: Found full path file 'runtimes/win-x64/native/cuda12/llama.dll' for relative path 'runtimes/win-x64/native/cuda12/llama.dll'
Info: Failed Loading 'runtimes/win-x64/native/cuda12/llama.dll'
Debug: Got relative library path 'runtimes/win-x64/native/vulkan/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: False, UseVulkan: True, AvxLevel: None), trying to load it...
Debug: Found full path file 'runtimes/win-x64/native/vulkan/llama.dll' for relative path 'runtimes/win-x64/native/vulkan/llama.dll'
Info: Failed Loading 'runtimes/win-x64/native/vulkan/llama.dll'
Debug: Got relative library path 'runtimes/win-x64/native/avx2/llama.dll' from local with (NativeLibraryName: LLama, UseCuda: False, UseVulkan: False, AvxLevel: Avx2), trying to load it...
Debug: Found full path file './runtimes/win-x64/native/avx2/llama.dll' for relative path 'runtimes/win-x64/native/avx2/llama.dll'
Info: Successfully loaded './runtimes/win-x64/native/avx2/llama.dll'
When I load a model on the CPU with GpuLayerCount equals to 0, the cuda backend is loaded
That's how it's meant to work - if the CUDA binaries are available and compatible with your system they will be used unless you explicitly disable CUDA at load time with NativeLibraryConfig.All.WithCuda(false)
.
Changing GpuLayerCount
changes how many layers are sent to the GPU, but does not change which backend is used. Setting it to zero should be equivalent to not using CUDA at all (although possibly slightly slower than the pure CPU binaries).
I'm wondering if this bug is still manifesting. Using latest 0.18.0 with CPU and Cuda12 backends installed, it defaults to CPU no matter what settings I specify. Here is what I am trying:
- NativeLibraryConfig.All.WithCuda(true) and GpuLayerCount=33 => CPU
- No NativeLibraryConfig parameters and GpuLayerCount=33 => CPU
- GPULayerCount=0 =>CPU
- GPULayerCount=-1 =>CPU
- Removing CPU backend => GPU
The only way I can get the Cuda12 backend to work is by removing CPU, which is contrary to the documentation ("Please note that before LLamaSharp v0.10.0, only one backend package should be installed at a time.").
What does the log call back show for you (see this comment just above for how to add the callback)?
Certainly!
Dependencies
Output GPU and CPU Backend.txt
I tested with NativeLibraryConfig.All.WithCuda(true) and just default settings -- the same settings which work if only the Cuda12 backend is installed.
There's something a bit odd going on here, your log shows that it tried to load things in this order:
Info: Failed Loading 'runtimes/win-x64/native/vulkan/llama.dll' Info: Successfully loaded './runtimes/win-x64/native/avx512/llama.dll'
So it tried to load a GPU backend and then tried to load a CPU backend when it couldn't do that, which is what we'd expect except that it was trying to load vulkan!
I'm at a bit of a loss what's going on here. @m0nsky any ideas?
Here's the associated code which manifests the bugged behavior.
using LLama.Common;
using LLama;
using LLama.Sampling;
using LLama.Native;
using System.Diagnostics;
namespace LLM
{
internal class Program
{
public static async Task Main(string[] args)
{
NativeLibraryConfig.All.WithLogCallback((LLamaLogLevel level, string message) =>
{
Console.Write($"{level}: {message}");
});
NativeLibraryConfig.All.WithCuda(true);
string modelPath = @"Meta-Llama-3.1-8B-Instruct-Q6_K.gguf";
var parameters = new ModelParams(modelPath)
{
ContextSize = 8192, // The longest length of chat as memory.
GpuLayerCount = 33, // How many layers to offload to GPU. Please adjust it according to your GPU memory.
};
using var model = LLamaWeights.LoadFromFile(parameters);
using var context = model.CreateContext(parameters);
var executor = new InteractiveExecutor(context);
// Add chat histories as prompt to tell AI how to act.
var chatHistory = new ChatHistory();
chatHistory.AddMessage(AuthorRole.System, "Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.");
chatHistory.AddMessage(AuthorRole.User, "Hello, Bob.");
chatHistory.AddMessage(AuthorRole.Assistant, "Hello. How may I help you today?");
ChatSession session = new(executor, chatHistory);
InferenceParams inferenceParams = new InferenceParams()
{
MaxTokens = 2048, // No more than 256 tokens should appear in answer. Remove it if antiprompt is enough for control.
AntiPrompts = new List<string> { "User:" }, // Stop generation once antiprompts appear.
SamplingPipeline = new DefaultSamplingPipeline(),
};
Console.ForegroundColor = ConsoleColor.Yellow;
Console.Write("Test Started: ");
Console.ForegroundColor = ConsoleColor.Green;
string userInput = "Write out the numbers from 1 to 100. e.g. one, two, three ...";
Console.WriteLine(userInput);
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
var tokens = session.ChatAsync(new ChatHistory.Message(AuthorRole.User, userInput), inferenceParams);
await foreach (var token in tokens)
{
Console.ForegroundColor = ConsoleColor.White;
Console.Write(token);
}
stopWatch.Stop();
Console.WriteLine();
Console.WriteLine(stopWatch.ElapsedMilliseconds + "ms");
}
}
}
When I load a model on the CPU with GpuLayerCount equals to 0, the cuda backend is loaded
That's how it's meant to work - if the CUDA binaries are available and compatible with your system they will be used unless you explicitly disable CUDA at load time with
NativeLibraryConfig.All.WithCuda(false)
.Changing
GpuLayerCount
changes how many layers are sent to the GPU, but does not change which backend is used. Setting it to zero should be equivalent to not using CUDA at all (although possibly slightly slower than the pure CPU binaries).
Thank you for your response @martindevans! I have a follow up question when I set GpuLayerCount
to zero (I have a GPU and the CUDA backend is loaded), when I serve a model I noticed that some memory was allocated on my GPU. I suppose that this is the allocation that happens when creating a LLamaContext
.
Is this considered normal when GpuLayerCount
is set to zero?
I'm not sure about that, ideally there probably shouldn't be any memory used, but I can easily imagine some resources getting created even when they're not technically needed. I'd recommend asking it upstream in the llama.cpp repo, they'll know more about the details.