llamafile
llamafile copied to clipboard
LLaVA loads with errors, GUI opens but doesn't produce any output
I renamed the file to llava-v1.5-7b-q4-server.llamafile.exe, ran it in the cmd.exe terminal. Everything looked okay (who knows? I don't know what all the output means) but it ended with this:
llama.cpp/ggml.c:9064: assert(!isnan(x)) failed (cosmoaddr2line /C/Users/shady/Desktop/llava-v1.5-7b-q4-server.llamafile.exe 4c5399 4e3822 4e7219 5c952c 5e75c3)
llama.cpp/ggml.c:9064: assert(!isnan(x)) failed (cosmoaddr2line /C/Users/shady/Desktop/llava-v1.5-7b-q4-server.llamafile.exe 4c5399 4e3822 4e7219 5c952c 5e75c3)
error: Uncaught SIGABRT (SI_TKILL) at 0 on DESKTOP-MM6QUGS pid 7460 tid 11132
llava-v1.5-7b-q4-server.llamafile
No error information
Windows Cosmopolitan 3.1.2 MODE=x86_64 DESKTOP-MM6QUGS 10.0-19045
RAX 0000100080b7f950 RBX 000000000061b6f3 RDI 0000100080b7f880
RCX 000010008019d7b0 RDX 0000000000000000 RSI 00000000fffffffa
RBP 0000100080b7fbd0 RSP 0000100080b7f760 RIP 0000000000410672
R8 0000000000000000 R9 0000000000000001 R10 0000000000000000
R11 0000000000000246 R12 0000000000000006 R13 000000000061b4bf
R14 0000100080b7fc20 R15 0000000000002aff
TLS 000010008019d760
XMM0 00000000000000000000000000000000 XMM8 00000000000000000000000000000000
XMM1 00000000000000000000000000000000 XMM9 00000000000000000000000000000000
XMM2 00000000000000000000000000000000 XMM10 00000000000000000000000000000000
XMM3 00000000000000000000000000000000 XMM11 00000000000000000000000000000000
XMM4 00000000000000000000000000000000 XMM12 00000000000000000000000000000000
XMM5 00000000000000000000000000000000 XMM13 00000000000000000000000000000000
XMM6 0000000000000000000000003727c5ac XMM14 00000000000000000000000000000000
XMM7 00000000000000000000000000000000 XMM15 00000000000000000000000000000000
The GUI opened up in my browser (Firefox) but I couldn't get it to produce any output. It would just freeze. I could reset the prompt and start over, but it never produces output. Running Windows 10 with an RTX 2060 graphics card
I have the same issue on Windows 10 with a 1660 Super
Running on WSL2 in Windows 11, it worked for several prompts then hit this error.
It doesn't work for me even with WSL2 in Windows 11
I'm having the same issue here. Been trolling a few forums regarding this matter, luckily still getting no exact answer.
Taking a look now. The assertion gives us the command to get the backtrace in case that's useful:
jart@nightmare:~/llamafile$ git checkout 0.4
HEAD is now at 188f7fc Release llamafile v0.4
jart@nightmare:~/llamafile$ make -j8 o//llama.cpp/server/server
[build output...]
jart@nightmare:~/llamafile$ cosmoaddr2line o//llama.cpp/server/server.com.dbg 4c5399 4e3822 4e7219 5c952c 5e75c3
0x00000000004c5399: ggml_compute_forward_silu_f32 at /home/jart/llamafile/llama.cpp/ggml.c:9064
(inlined by) ggml_compute_forward_silu at /home/jart/llamafile/llama.cpp/ggml.c:9078
(inlined by) ggml_compute_forward_unary at /home/jart/llamafile/llama.cpp/ggml.c:13567
0x00000000004e3822: ggml_compute_forward at /home/jart/llamafile/llama.cpp/ggml.c:14391
0x00000000004e7219: ggml_compute_forward at /home/jart/llamafile/llama.cpp/ggml.c:14136
(inlined by) ggml_graph_compute_thread at /home/jart/llamafile/llama.cpp/ggml.c:16306
0x00000000005c952c: PosixThread at /home/jart/cosmo/libc/thread/pthread_create.c:123
0x00000000005e75c3: __stack_call at /home/jart/cosmo/libc/intrin/stackcall.S:39
The line in question is here:
https://github.com/Mozilla-Ocho/llamafile/blob/658b18a2edfeaf021adaa33eb3b890b193a9a8ef/llama.cpp/ggml.c#L9060-L9067
Git blaming this upstream reveals a peculiar diff.
commit fcca0a700487999d52a525c96d6661e9f6a8703a
Author: Georgi Gerganov <[email protected]>
Date: Mon Oct 9 14:32:17 2023 +0300
refact : fix convert script + zero out KV cache to avoid nans (#3523)
* refact : fix convert script + zero out KV cache to avoid nans
* ggml : silu(-inf) should never happen
* metal : assert various kernel requirements
diff --git a/ggml.c b/ggml.c
index 6d1776c..5bb1da3 100644
--- a/ggml.c
+++ b/ggml.c
@@ -11233,7 +11233,7 @@ static void ggml_compute_forward_silu_f32(
#ifndef NDEBUG
for (int k = 0; k < nc; k++) {
- const float x = ((float *) ((char *) dst->data + i1*( dst->nb[1])))[k];
+ const float x = ((float *) ((char *) dst->data + i1*(dst->nb[1])))[k];
UNUSED(x);
assert(!isnan(x));
assert(!isinf(x));
Question: If you build llamafile from source as follows:
git pull
make clean
make -j8 CPPFLAGS=-DNDEBUG
Then do things work as expected?
Here's a prebuilt binary with debugging checks disabled to save your time: server.zip
Question: If you build llamafile from source as follows:
git pull make clean make -j8 CPPFLAGS=-DNDEBUG
Then do things work as expected?
Without disrespect, if you show us what has to happen in order for test this out, I'm glad to do it. But consider the following: I found llamafile and downloaded the example llamafile for LaVA https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4-server.llamafile?download=true
And that was it, I ran it, tried some prompt and it crashed.
Therefore, I don't know what you mean with "If you build llamafile from source as follows".
Would that snippet of code work on Windows 10?
I guess it's more oriented to a Linux OS?
If you have the time to tell me how to build it, or how to use the server.zip file you created, I gladly do it.
Bests,
Ok, I'll leave this here as I have no clue if it's useful.
.\server.exe -m .\mistral-7b-instruct-v0.1.Q4_K_M.gguf
NVIDIA cuBLAS GPU support successfully loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6
{"timestamp":1703734116,"level":"INFO","function":"main","line":2669,"message":"build info","build":1500,"commit":"a30b324"}
{"timestamp":1703734116,"level":"INFO","function":"main","line":2672,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_K [ 4096, 1024, 1, 1 ]
[... trimmed by jart for brevity ...]
llama_model_loader: - tensor 285: blk.31.ffn_up.weight q4_K [ 4096, 14336, 1, 1 ]
llama_model_loader: - tensor 286: blk.31.ffn_down.weight q6_K [ 14336, 4096, 1, 1 ]
llama_model_loader: - tensor 287: blk.31.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 288: blk.31.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 289: output_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q4_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.12 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 4165.48 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: VRAM used: 0.00 MiB
warning: posix_madvise(.., POSIX_MADV_WILLNEED) failed: No error information (win32 error 998)
...............................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.32 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 73.00 MiB (model: 0.00 MiB, context: 73.00 MiB)
Available slots:
-> Slot 0 - max context: 512
llama server listening at http://127.0.0.1:8080
failed to open http://127.0.0.1:8080/ in a browser tab using /c/windows/explorer.exe: process exited with non-zero status
loading weights...
{"timestamp":1703734124,"level":"INFO","function":"main","line":3068,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1703734125,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1703734125,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1703734125,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
{"timestamp":1703734125,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1703734125,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 60 tokens
slot 0 : kv cache rm - [0, end)
print_timings: prompt eval time = 5533.45 ms / 60 tokens ( 92.22 ms per token, 10.84 tokens per second)
print_timings: eval time = 303.95 ms / 2 runs ( 151.97 ms per token, 6.58 tokens per second)
print_timings: total time = 5837.40 ms
slot 0 released (63 tokens in cache)
{"timestamp":1703734150,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"POST","path":"/completion","params":{}}
OS Information:
Edition Windows 10 Pro
Version 22H2
Installed on 2023-07-15
OS build 19045.2965
Experience Windows Feature Experience Pack 1000.19041.1000.0
PC Information:
CPU: i7-4790K
RAM 32 GB
GPU RTX 3060 Ti (8GB)
There's an error message if you search 'error 998'
Bests,
@frenchiveruti I'm assuming you downloaded the "server.zip" prebuilt binary I attached to my message above, extracted it, renamed the file to have .exe, ran it, and it worked? Please confirm.
Here's a clue. So far this issue has been reported with the following cards:
- NVIDIA GeForce RTX 3060 Ti
- NVIDIA Geforce RTX 3050
I'm having the same issue.
When I run the 0.4 server version on Windows 10 with an NVIDIA GeForce RTX 3060 12GB (non Ti) I get either gibberish output, a blank "Llama: " reply, or no output at all.
I just tried the "server.zip" version posted above by @jart and it's the same results, no difference, it doesn't work as expected. The posted "server.zip" version does not solve the problem on a NVIDIA GeForce RTX 3060 12GB non Ti GPU.
If I run the server version, both 0.4 and the one posted above on a Windows 10 computer with no NVIDIA GPU, just running from an Intel 12th gen CPU, it works as expected.
@frenchiveruti I'm assuming you downloaded the "server.zip" prebuilt binary I attached to my message above, extracted it, renamed the file to have .exe, ran it, and it worked? Please confirm.
Hi, indeed I did, I downloaded the one you attached and tried running a model I had downloaded before which is Mistral 7B
@Hiltronix @frenchiveruti Is it possible that you're uploading an image and image processing is just going very very slowly? It's possible that this issue might be a duplicate of #142. If you wait several minutes for it to produce a result, or maybe try a tinier image, then you'll know for sure. If that's the case, then please confirm so I can close that out. Then use this simple workaround:
The problem with #142 is that tinyBLAS processes images slower than we anticipated when we wrote it. We're working on fixing that. In the meantime, the old llamafile release (0.4) from two weeks ago, and the new 0.4.1 release today, both allow you to use NVIDIA's faster cuBLAS library instead. The way you do that is:
- Delete the .llamafile directory in your home directory.
- Install CUDA
- Install MSVC
- Open the "x64 MSVC command prompt" from Start
- Run llamafile there for the first invocation.
There's a YouTube video tutorial on doing this here: https://youtu.be/d1Fnfvat6nM?si=W6Y0miZ9zVBHySFj
Hi, I'll test again later today, but I did not use an Image. I was just prompting the LLM model with a simple test to see how it performed
Here's how the server you provided performed with mistral 7B v0.2 Q4. The launch command was:
.\server.exe -m .\mistral-7b-instruct-v0.2.Q4_K_M.gguf --no-mmap -ngl 35
Then, the settings in the prompt template for llama cpp server were:
Finally, it worked fantastic with a 3060 Ti.
I did not try out image recognition with LLaVa.
- If it's working, then was it the
--no-mmap
or the-ngl 35
flag that fixed it for you? - What URL did you download
server.exe
from?
If it's working, then was it the
--no-mmap
or the-ngl 35
flag that fixed it for you?
- What URL did you download
server.exe
from?
The NGL 35 parameter is the one suggested on the GPU Support section of this GitHub codebase.
The --no-mmap
parameter I used it because I was running into the 998 error on windows. Which signifies an out of memory, after starting using it, the issues went away.
The URL is the one from your first comment in this thread, the one with debugging checks disabled.
I'm on mobile, sorry for the lack of links.
998 means ERROR_NOACCESS
which is caused by madvise() failing due to a bug I recently fixed with Cosmopolitan Libc that'll trickle down into future llamafile releases. I'd be surprised if that impacted mmap() which should be unrelated unless corruption is happening.
Also just to be clear, are we still talking about the the NaN assertion error? Since it'd seem odd that memory mapping issues would cause NaNs. One thing that could be causing the NaNs is I recently spotted an issue where tinyBLAS was narrowing float
to half
in one of our GEMM functions, which I know is used by LLaVA when processing images (although I don't think it's used by Mistral). See: https://github.com/Mozilla-Ocho/llamafile/blob/6423228b5ddd4862a3ab3d275a168692dadf4cdc/llamafile/tinyblas.cu#L50-L51
Ok, let's do this.
When I'm back home today I'll try to run the server you provided + Mistral without any parameters. Then add NGL and then add mmap, sounds good?
Sounds good. Thanks for volunteering to help us get to the bottom of this.
@jart Here is what I've found.
I'm using Windows 10 and a Nvidia RTX3060 12GB GPU.
If I follow the instructions to install the Cuda Toolkit, open the MSVC prompt, and then run "llava-v1.5-7b-q4-server.llamafile" so that it builds the ".llamafile" folder containing the compiled DDLs, and then run "llava-v1.5-7b-q4-server.llamafile", even from a normal commandline prompt, the result is a GPU accelerated chat AI working as expected. I have gotten this to work with the RTX3060.
But if I try to use "llamafile-server.exe" from version 0.4.0, 0.4.1, or the special "server.exe" build you posted in the zip above, I can't get it to work with the GPU, then I just get gibberish output or no output, even with using the option "-ngl 35". I have copied the compiled ".llamafile" folder to the folder that "llamafile-server.exe" resides in. If I don't use "-ngl 35" then "llamafile-server.exe" will crash when entering the prompt in the browser on the computer with the RTX3060, but works as expected on the computer without the RTX3060. This is all when using a downloaded GGUF LLM file I've downloaded from huggingface, which always works fine on the computer without the RTX3060.
Maybe I have the wrong expectations, so please clarify for me. I was hoping and expecting to be able to GPU accelerate with the RTX3060 the "llamafile-server.exe" while using my own selected GGUF LLM from huggingface. Perhaps I'm mistaken in my understanding of the current situation of the project. Is GPU acceleration only working for the prebuilt llamafile EXEs that already have a LLM included like "llava-v1.5-7b-q4-server.llamafile"?
Thanks.
@jart Here is what I've found.
I'm using Windows 10 and a Nvidia RTX3060 12GB GPU.
If I follow the instructions to install the Cuda Toolkit, open the MSVC prompt, and then run "llava-v1.5-7b-q4-server.llamafile" so that it builds the ".llamafile" folder containing the compiled DDLs, and then run "llava-v1.5-7b-q4-server.llamafile", even from a normal commandline prompt, the result is a GPU accelerated chat AI working as expected. I have gotten this to work with the RTX3060.
But if I try to use "llamafile-server.exe" from version 0.4.0, 0.4.1, or the special "server.exe" build you posted in the zip above, I can't get it to work with the GPU, then I just get gibberish output or no output, even with using the option "-ngl 35". I have copied the compiled ".llamafile" folder to the folder that "llamafile-server.exe" resides in. If I don't use "-ngl 35" then "llamafile-server.exe" will crash when entering the prompt in the browser on the computer with the RTX3060, but works as expected on the computer without the RTX3060. This is all when using a downloaded GGUF LLM file I've downloaded from huggingface, which always works fine on the computer without the RTX3060.
Maybe I have the wrong expectations, so please clarify for me. I was hoping and expecting to be able to GPU accelerate with the RTX3060 the "llamafile-server.exe" while using my own selected GGUF LLM from huggingface. Perhaps I'm mistaken in my understanding of the current situation of the project. Is GPU acceleration only working for the prebuilt llamafile EXEs that already have a LLM included like "llava-v1.5-7b-q4-server.llamafile"?
Thanks.
Maybe I'm getting it wrong, but at no point you're mentioning the usage of a model via the -m
What commands are you running exactly?
What output are you getting?
What model are you using?
The GPU acceleration for example, works fine for the guy above in this same thread, https://github.com/Mozilla-Ocho/llamafile/issues/104#issuecomment-1871666169
so maybe if you provide more detail you can get more help?
As I mentioned above, the output I get is random gibberish or no output. It's one or the other each run, always different. I am using -m as it's an external model.
The format of the command I'm using is as follows with three of the model names I'm testing listed below. llamafile-server.exe -ngl 35 -m "kai-7b-instruct.Q4_K_M.gguf" llamafile-server.exe -ngl 35 -m "llama-2-7b-chat.Q4_K_M.gguf" llamafile-server.exe -ngl 35 -m "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
The option "--no-mmap" that the guy that has it working makes no difference in my case, same results.
Keep in mind that each of the models you listed has a different prompt setup.
For example, here's what Mistral expects:
I don't know about Kai or Llama but I'm certain they need their proper templates to work correctly.
@Hiltronix It sounds then in your case like something is wrong with our new tinyBLAS library, which is attempt to help people not need to install CUDA and MSVC to get GPU acceleration. I might have to buy one of those graphics cards in order to troubleshoot what's wrong.
@Hiltronix It sounds then in your case like something is wrong with our new tinyBLAS library, which is attempt to help people not need to install CUDA and MSVC to get GPU acceleration. I might have to buy one of those graphics cards in order to troubleshoot what's wrong.
Thanks for the reply. I'm rather new to experimenting with LLMs, and decided to use the Mozilla-Ocho project to get me started, and get some quick general results before diving in deep. The llama-server and the web interface has worked well for me so far without the Nvidia GPU. I've used multiple chat oriented GGUF models from huggingface with the default settings in the web interface with good results.
I just bought the RTX3060 12GB GPU thinking it would be a good "bang for the buck" starter card to use for this. I actually have no other Nvidia GPU to compare to see if it's this particular GPU or not, that is the issue. I've tried to detail what I've done so far to make GPU acceleration work with the llama-server. If there is something else you'd me to try, to make sure it's not me or my setup, I'm open to suggestions. If you think it is this exact GPU that is different some how, I'm happy to help by trying builds you want me to test.
Really happy to hear CPU inference is working well. Thanks for all the clues you've shared on GPU support too! If you wanted to experiment with other GPUs, one thing I like to do sometimes is rent a GCE VM with an NVIDIA L4 card for a few hours. Costs a few dollars. But I can just scp
my llamafile onto the VM via SSH and run it. It's nice for testing releases. Windows is the tricky one though. The good news is some other known issues with Windows performance should be getting fixed within the next 24 hours. Stay tuned. I'm crossing my fingers and hoping the improvements planned will solve other mysteries too.
I've just merged a major improvement to tinyBLAS on Windows in #153 and uploaded new weights to Hugging Face for LLaVA. If you've been having issues running LLaVA on Windows for image processing, then please give these latest llamafiles a try.
- https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main
I don't know for certain if it fixes this issue, but it's worth trying in case there's any overlap with the improvements we've made.
Ok, let's do this. When I'm back home today I'll try to run the server you provided + Mistral without any parameters. Then add NGL and then add mmap, sounds good?
I ran:
.\server.exe -m .\mistral-7b-instruct-v0.2.Q4_K_M.gguf -ngl 35
And it runs just fine at least for written prompts, did not try image analysis. The issue I was having was with the prompt template.
You definitely can't give an image to Mistral since it doesn't have a vision model (i.e. the --mmproj file). You can only do images with LLaVA right now. Glad to hear you figured out the issue with the prompt.
I haven't heard from the OP in a while, so I'm leaning in the direction of closing this issue. Could anyone else confirm if the tinyBLAS improvement, or prompt engineering helps?
I'll be honest, I can't follow anything that you all are talking about. I'll be back to my computer in a day or two and I'll try the default method again and report back if it works.