llamafile icon indicating copy to clipboard operation
llamafile copied to clipboard

LLaVA loads with errors, GUI opens but doesn't produce any output

Open dkallen78 opened this issue 1 year ago • 34 comments

I renamed the file to llava-v1.5-7b-q4-server.llamafile.exe, ran it in the cmd.exe terminal. Everything looked okay (who knows? I don't know what all the output means) but it ended with this:

llama.cpp/ggml.c:9064: assert(!isnan(x)) failed (cosmoaddr2line /C/Users/shady/Desktop/llava-v1.5-7b-q4-server.llamafile.exe 4c5399 4e3822 4e7219 5c952c 5e75c3)
llama.cpp/ggml.c:9064: assert(!isnan(x)) failed (cosmoaddr2line /C/Users/shady/Desktop/llava-v1.5-7b-q4-server.llamafile.exe 4c5399 4e3822 4e7219 5c952c 5e75c3)

error: Uncaught SIGABRT (SI_TKILL) at 0 on DESKTOP-MM6QUGS pid 7460 tid 11132
  llava-v1.5-7b-q4-server.llamafile
  No error information
  Windows Cosmopolitan 3.1.2 MODE=x86_64 DESKTOP-MM6QUGS 10.0-19045

RAX 0000100080b7f950 RBX 000000000061b6f3 RDI 0000100080b7f880
RCX 000010008019d7b0 RDX 0000000000000000 RSI 00000000fffffffa
RBP 0000100080b7fbd0 RSP 0000100080b7f760 RIP 0000000000410672
 R8 0000000000000000  R9 0000000000000001 R10 0000000000000000
R11 0000000000000246 R12 0000000000000006 R13 000000000061b4bf
R14 0000100080b7fc20 R15 0000000000002aff
TLS 000010008019d760

XMM0  00000000000000000000000000000000 XMM8  00000000000000000000000000000000
XMM1  00000000000000000000000000000000 XMM9  00000000000000000000000000000000
XMM2  00000000000000000000000000000000 XMM10 00000000000000000000000000000000
XMM3  00000000000000000000000000000000 XMM11 00000000000000000000000000000000
XMM4  00000000000000000000000000000000 XMM12 00000000000000000000000000000000
XMM5  00000000000000000000000000000000 XMM13 00000000000000000000000000000000
XMM6  0000000000000000000000003727c5ac XMM14 00000000000000000000000000000000
XMM7  00000000000000000000000000000000 XMM15 00000000000000000000000000000000

The GUI opened up in my browser (Firefox) but I couldn't get it to produce any output. It would just freeze. I could reset the prompt and start over, but it never produces output. Running Windows 10 with an RTX 2060 graphics card

dkallen78 avatar Dec 15 '23 15:12 dkallen78

I have the same issue on Windows 10 with a 1660 Super

jhall39 avatar Dec 16 '23 01:12 jhall39

Running on WSL2 in Windows 11, it worked for several prompts then hit this error.

yhack avatar Dec 16 '23 02:12 yhack

It doesn't work for me even with WSL2 in Windows 11

achalpandeyy avatar Dec 22 '23 19:12 achalpandeyy

I'm having the same issue here. Been trolling a few forums regarding this matter, luckily still getting no exact answer.

silsicksix avatar Dec 25 '23 03:12 silsicksix

Taking a look now. The assertion gives us the command to get the backtrace in case that's useful:

jart@nightmare:~/llamafile$ git checkout 0.4
HEAD is now at 188f7fc Release llamafile v0.4
jart@nightmare:~/llamafile$ make -j8 o//llama.cpp/server/server
[build output...]
jart@nightmare:~/llamafile$ cosmoaddr2line o//llama.cpp/server/server.com.dbg 4c5399 4e3822 4e7219 5c952c 5e75c3
0x00000000004c5399: ggml_compute_forward_silu_f32 at /home/jart/llamafile/llama.cpp/ggml.c:9064
 (inlined by) ggml_compute_forward_silu at /home/jart/llamafile/llama.cpp/ggml.c:9078
 (inlined by) ggml_compute_forward_unary at /home/jart/llamafile/llama.cpp/ggml.c:13567
0x00000000004e3822: ggml_compute_forward at /home/jart/llamafile/llama.cpp/ggml.c:14391
0x00000000004e7219: ggml_compute_forward at /home/jart/llamafile/llama.cpp/ggml.c:14136
 (inlined by) ggml_graph_compute_thread at /home/jart/llamafile/llama.cpp/ggml.c:16306
0x00000000005c952c: PosixThread at /home/jart/cosmo/libc/thread/pthread_create.c:123
0x00000000005e75c3: __stack_call at /home/jart/cosmo/libc/intrin/stackcall.S:39

The line in question is here:

https://github.com/Mozilla-Ocho/llamafile/blob/658b18a2edfeaf021adaa33eb3b890b193a9a8ef/llama.cpp/ggml.c#L9060-L9067

Git blaming this upstream reveals a peculiar diff.

commit fcca0a700487999d52a525c96d6661e9f6a8703a
Author: Georgi Gerganov <[email protected]>
Date:   Mon Oct 9 14:32:17 2023 +0300

    refact : fix convert script + zero out KV cache to avoid nans (#3523)

    * refact : fix convert script + zero out KV cache to avoid nans

    * ggml : silu(-inf) should never happen

    * metal : assert various kernel requirements

diff --git a/ggml.c b/ggml.c
index 6d1776c..5bb1da3 100644
--- a/ggml.c
+++ b/ggml.c
@@ -11233,7 +11233,7 @@ static void ggml_compute_forward_silu_f32(

 #ifndef NDEBUG
         for (int k = 0; k < nc; k++) {
-            const float x = ((float *) ((char *) dst->data + i1*( dst->nb[1])))[k];
+            const float x = ((float *) ((char *) dst->data + i1*(dst->nb[1])))[k];
             UNUSED(x);
             assert(!isnan(x));
             assert(!isinf(x));

Question: If you build llamafile from source as follows:

git pull
make clean
make -j8 CPPFLAGS=-DNDEBUG

Then do things work as expected?

Here's a prebuilt binary with debugging checks disabled to save your time: server.zip

jart avatar Dec 27 '23 23:12 jart

Question: If you build llamafile from source as follows:

git pull
make clean
make -j8 CPPFLAGS=-DNDEBUG

Then do things work as expected?

Without disrespect, if you show us what has to happen in order for test this out, I'm glad to do it. But consider the following: I found llamafile and downloaded the example llamafile for LaVA https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4-server.llamafile?download=true

And that was it, I ran it, tried some prompt and it crashed.

Therefore, I don't know what you mean with "If you build llamafile from source as follows".
Would that snippet of code work on Windows 10? I guess it's more oriented to a Linux OS?

If you have the time to tell me how to build it, or how to use the server.zip file you created, I gladly do it.

Bests,

frenchiveruti avatar Dec 28 '23 03:12 frenchiveruti

Ok, I'll leave this here as I have no clue if it's useful.

 .\server.exe -m .\mistral-7b-instruct-v0.1.Q4_K_M.gguf
NVIDIA cuBLAS GPU support successfully loaded
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6
{"timestamp":1703734116,"level":"INFO","function":"main","line":2669,"message":"build info","build":1500,"commit":"a30b324"}
{"timestamp":1703734116,"level":"INFO","function":"main","line":2672,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]
[... trimmed by jart for brevity ...]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 4165.48 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: VRAM used: 0.00 MiB
warning: posix_madvise(.., POSIX_MADV_WILLNEED) failed: No error information (win32 error 998)
...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.32 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 73.00 MiB (model: 0.00 MiB, context: 73.00 MiB)
Available slots:
 -> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8080

failed to open http://127.0.0.1:8080/ in a browser tab using /c/windows/explorer.exe: process exited with non-zero status
loading weights...
{"timestamp":1703734124,"level":"INFO","function":"main","line":3068,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1703734125,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1703734125,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1703734125,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
{"timestamp":1703734125,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1703734125,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 60 tokens
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    5533.45 ms /    60 tokens (   92.22 ms per token,    10.84 tokens per second)
print_timings:        eval time =     303.95 ms /     2 runs   (  151.97 ms per token,     6.58 tokens per second)
print_timings:       total time =    5837.40 ms
slot 0 released (63 tokens in cache)
{"timestamp":1703734150,"level":"INFO","function":"log_server_request","line":2603,"message":"request","remote_addr":"","remote_port":-1,"status":200,"method":"POST","path":"/completion","params":{}}

OS Information:

Edition	Windows 10 Pro
Version	22H2
Installed on	‎2023-‎07-‎15
OS build	19045.2965
Experience	Windows Feature Experience Pack 1000.19041.1000.0

PC Information:

CPU: i7-4790K
RAM 32 GB
GPU RTX 3060 Ti (8GB)

There's an error message if you search 'error 998'

Bests,

frenchiveruti avatar Dec 28 '23 03:12 frenchiveruti

@frenchiveruti I'm assuming you downloaded the "server.zip" prebuilt binary I attached to my message above, extracted it, renamed the file to have .exe, ran it, and it worked? Please confirm.

jart avatar Dec 28 '23 06:12 jart

Here's a clue. So far this issue has been reported with the following cards:

  • NVIDIA GeForce RTX 3060 Ti
  • NVIDIA Geforce RTX 3050

jart avatar Dec 28 '23 07:12 jart

I'm having the same issue.

When I run the 0.4 server version on Windows 10 with an NVIDIA GeForce RTX 3060 12GB (non Ti) I get either gibberish output, a blank "Llama: " reply, or no output at all.

I just tried the "server.zip" version posted above by @jart and it's the same results, no difference, it doesn't work as expected. The posted "server.zip" version does not solve the problem on a NVIDIA GeForce RTX 3060 12GB non Ti GPU.

If I run the server version, both 0.4 and the one posted above on a Windows 10 computer with no NVIDIA GPU, just running from an Intel 12th gen CPU, it works as expected.

Hiltronix avatar Dec 28 '23 13:12 Hiltronix

@frenchiveruti I'm assuming you downloaded the "server.zip" prebuilt binary I attached to my message above, extracted it, renamed the file to have .exe, ran it, and it worked? Please confirm.

Hi, indeed I did, I downloaded the one you attached and tried running a model I had downloaded before which is Mistral 7B

frenchiveruti avatar Dec 28 '23 13:12 frenchiveruti

@Hiltronix @frenchiveruti Is it possible that you're uploading an image and image processing is just going very very slowly? It's possible that this issue might be a duplicate of #142. If you wait several minutes for it to produce a result, or maybe try a tinier image, then you'll know for sure. If that's the case, then please confirm so I can close that out. Then use this simple workaround:

The problem with #142 is that tinyBLAS processes images slower than we anticipated when we wrote it. We're working on fixing that. In the meantime, the old llamafile release (0.4) from two weeks ago, and the new 0.4.1 release today, both allow you to use NVIDIA's faster cuBLAS library instead. The way you do that is:

  1. Delete the .llamafile directory in your home directory.
  2. Install CUDA
  3. Install MSVC
  4. Open the "x64 MSVC command prompt" from Start
  5. Run llamafile there for the first invocation.

There's a YouTube video tutorial on doing this here: https://youtu.be/d1Fnfvat6nM?si=W6Y0miZ9zVBHySFj

jart avatar Dec 28 '23 15:12 jart

Hi, I'll test again later today, but I did not use an Image. I was just prompting the LLM model with a simple test to see how it performed

frenchiveruti avatar Dec 28 '23 16:12 frenchiveruti

Here's how the server you provided performed with mistral 7B v0.2 Q4. The launch command was:

 .\server.exe -m .\mistral-7b-instruct-v0.2.Q4_K_M.gguf --no-mmap -ngl 35

Then, the settings in the prompt template for llama cpp server were: image

Finally, it worked fantastic with a 3060 Ti.

I did not try out image recognition with LLaVa.

frenchiveruti avatar Dec 29 '23 01:12 frenchiveruti

  1. If it's working, then was it the --no-mmap or the -ngl 35 flag that fixed it for you?
  2. What URL did you download server.exe from?

jart avatar Dec 29 '23 08:12 jart

  1. If it's working, then was it the --no-mmap or the -ngl 35 flag that fixed it for you?

    1. What URL did you download server.exe from?

The NGL 35 parameter is the one suggested on the GPU Support section of this GitHub codebase.

The --no-mmap parameter I used it because I was running into the 998 error on windows. Which signifies an out of memory, after starting using it, the issues went away.

The URL is the one from your first comment in this thread, the one with debugging checks disabled.

I'm on mobile, sorry for the lack of links.

frenchiveruti avatar Dec 29 '23 13:12 frenchiveruti

998 means ERROR_NOACCESS which is caused by madvise() failing due to a bug I recently fixed with Cosmopolitan Libc that'll trickle down into future llamafile releases. I'd be surprised if that impacted mmap() which should be unrelated unless corruption is happening.

Also just to be clear, are we still talking about the the NaN assertion error? Since it'd seem odd that memory mapping issues would cause NaNs. One thing that could be causing the NaNs is I recently spotted an issue where tinyBLAS was narrowing float to half in one of our GEMM functions, which I know is used by LLaVA when processing images (although I don't think it's used by Mistral). See: https://github.com/Mozilla-Ocho/llamafile/blob/6423228b5ddd4862a3ab3d275a168692dadf4cdc/llamafile/tinyblas.cu#L50-L51

jart avatar Dec 29 '23 14:12 jart

Ok, let's do this.
When I'm back home today I'll try to run the server you provided + Mistral without any parameters. Then add NGL and then add mmap, sounds good?

frenchiveruti avatar Dec 29 '23 15:12 frenchiveruti

Sounds good. Thanks for volunteering to help us get to the bottom of this.

jart avatar Dec 29 '23 15:12 jart

@jart Here is what I've found.

I'm using Windows 10 and a Nvidia RTX3060 12GB GPU.

If I follow the instructions to install the Cuda Toolkit, open the MSVC prompt, and then run "llava-v1.5-7b-q4-server.llamafile" so that it builds the ".llamafile" folder containing the compiled DDLs, and then run "llava-v1.5-7b-q4-server.llamafile", even from a normal commandline prompt, the result is a GPU accelerated chat AI working as expected. I have gotten this to work with the RTX3060.

But if I try to use "llamafile-server.exe" from version 0.4.0, 0.4.1, or the special "server.exe" build you posted in the zip above, I can't get it to work with the GPU, then I just get gibberish output or no output, even with using the option "-ngl 35". I have copied the compiled ".llamafile" folder to the folder that "llamafile-server.exe" resides in. If I don't use "-ngl 35" then "llamafile-server.exe" will crash when entering the prompt in the browser on the computer with the RTX3060, but works as expected on the computer without the RTX3060. This is all when using a downloaded GGUF LLM file I've downloaded from huggingface, which always works fine on the computer without the RTX3060.

Maybe I have the wrong expectations, so please clarify for me. I was hoping and expecting to be able to GPU accelerate with the RTX3060 the "llamafile-server.exe" while using my own selected GGUF LLM from huggingface. Perhaps I'm mistaken in my understanding of the current situation of the project. Is GPU acceleration only working for the prebuilt llamafile EXEs that already have a LLM included like "llava-v1.5-7b-q4-server.llamafile"?

Thanks.

Hiltronix avatar Dec 29 '23 16:12 Hiltronix

@jart Here is what I've found.

I'm using Windows 10 and a Nvidia RTX3060 12GB GPU.

If I follow the instructions to install the Cuda Toolkit, open the MSVC prompt, and then run "llava-v1.5-7b-q4-server.llamafile" so that it builds the ".llamafile" folder containing the compiled DDLs, and then run "llava-v1.5-7b-q4-server.llamafile", even from a normal commandline prompt, the result is a GPU accelerated chat AI working as expected. I have gotten this to work with the RTX3060.

But if I try to use "llamafile-server.exe" from version 0.4.0, 0.4.1, or the special "server.exe" build you posted in the zip above, I can't get it to work with the GPU, then I just get gibberish output or no output, even with using the option "-ngl 35". I have copied the compiled ".llamafile" folder to the folder that "llamafile-server.exe" resides in. If I don't use "-ngl 35" then "llamafile-server.exe" will crash when entering the prompt in the browser on the computer with the RTX3060, but works as expected on the computer without the RTX3060. This is all when using a downloaded GGUF LLM file I've downloaded from huggingface, which always works fine on the computer without the RTX3060.

Maybe I have the wrong expectations, so please clarify for me. I was hoping and expecting to be able to GPU accelerate with the RTX3060 the "llamafile-server.exe" while using my own selected GGUF LLM from huggingface. Perhaps I'm mistaken in my understanding of the current situation of the project. Is GPU acceleration only working for the prebuilt llamafile EXEs that already have a LLM included like "llava-v1.5-7b-q4-server.llamafile"?

Thanks.

Maybe I'm getting it wrong, but at no point you're mentioning the usage of a model via the -m What commands are you running exactly? What output are you getting?
What model are you using?

The GPU acceleration for example, works fine for the guy above in this same thread, https://github.com/Mozilla-Ocho/llamafile/issues/104#issuecomment-1871666169

so maybe if you provide more detail you can get more help?

francisco-lafe avatar Dec 29 '23 19:12 francisco-lafe

As I mentioned above, the output I get is random gibberish or no output. It's one or the other each run, always different. I am using -m as it's an external model.

The format of the command I'm using is as follows with three of the model names I'm testing listed below. llamafile-server.exe -ngl 35 -m "kai-7b-instruct.Q4_K_M.gguf" llamafile-server.exe -ngl 35 -m "llama-2-7b-chat.Q4_K_M.gguf" llamafile-server.exe -ngl 35 -m "mistral-7b-instruct-v0.2.Q4_K_M.gguf"

The option "--no-mmap" that the guy that has it working makes no difference in my case, same results.

Hiltronix avatar Dec 29 '23 20:12 Hiltronix

Keep in mind that each of the models you listed has a different prompt setup. For example, here's what Mistral expects: image

I don't know about Kai or Llama but I'm certain they need their proper templates to work correctly.

francisco-lafe avatar Dec 29 '23 21:12 francisco-lafe

@Hiltronix It sounds then in your case like something is wrong with our new tinyBLAS library, which is attempt to help people not need to install CUDA and MSVC to get GPU acceleration. I might have to buy one of those graphics cards in order to troubleshoot what's wrong.

jart avatar Dec 29 '23 21:12 jart

@Hiltronix It sounds then in your case like something is wrong with our new tinyBLAS library, which is attempt to help people not need to install CUDA and MSVC to get GPU acceleration. I might have to buy one of those graphics cards in order to troubleshoot what's wrong.

Thanks for the reply. I'm rather new to experimenting with LLMs, and decided to use the Mozilla-Ocho project to get me started, and get some quick general results before diving in deep. The llama-server and the web interface has worked well for me so far without the Nvidia GPU. I've used multiple chat oriented GGUF models from huggingface with the default settings in the web interface with good results.

I just bought the RTX3060 12GB GPU thinking it would be a good "bang for the buck" starter card to use for this. I actually have no other Nvidia GPU to compare to see if it's this particular GPU or not, that is the issue. I've tried to detail what I've done so far to make GPU acceleration work with the llama-server. If there is something else you'd me to try, to make sure it's not me or my setup, I'm open to suggestions. If you think it is this exact GPU that is different some how, I'm happy to help by trying builds you want me to test.

Hiltronix avatar Dec 29 '23 22:12 Hiltronix

Really happy to hear CPU inference is working well. Thanks for all the clues you've shared on GPU support too! If you wanted to experiment with other GPUs, one thing I like to do sometimes is rent a GCE VM with an NVIDIA L4 card for a few hours. Costs a few dollars. But I can just scp my llamafile onto the VM via SSH and run it. It's nice for testing releases. Windows is the tricky one though. The good news is some other known issues with Windows performance should be getting fixed within the next 24 hours. Stay tuned. I'm crossing my fingers and hoping the improvements planned will solve other mysteries too.

jart avatar Dec 29 '23 22:12 jart

I've just merged a major improvement to tinyBLAS on Windows in #153 and uploaded new weights to Hugging Face for LLaVA. If you've been having issues running LLaVA on Windows for image processing, then please give these latest llamafiles a try.

  • https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main

I don't know for certain if it fixes this issue, but it's worth trying in case there's any overlap with the improvements we've made.

jart avatar Dec 30 '23 16:12 jart

Ok, let's do this. When I'm back home today I'll try to run the server you provided + Mistral without any parameters. Then add NGL and then add mmap, sounds good?

I ran: .\server.exe -m .\mistral-7b-instruct-v0.2.Q4_K_M.gguf -ngl 35

And it runs just fine at least for written prompts, did not try image analysis. The issue I was having was with the prompt template.

frenchiveruti avatar Dec 30 '23 21:12 frenchiveruti

You definitely can't give an image to Mistral since it doesn't have a vision model (i.e. the --mmproj file). You can only do images with LLaVA right now. Glad to hear you figured out the issue with the prompt.

I haven't heard from the OP in a while, so I'm leaning in the direction of closing this issue. Could anyone else confirm if the tinyBLAS improvement, or prompt engineering helps?

jart avatar Dec 30 '23 22:12 jart

I'll be honest, I can't follow anything that you all are talking about. I'll be back to my computer in a day or two and I'll try the default method again and report back if it works.

dkallen78 avatar Dec 31 '23 18:12 dkallen78