Kerfuffle

I like strawberry nicecream.

Results 159 comments of


                                            Kerfuffle

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

Glad to help! I tried again. I don't know if it's important, but I just compile using `make`, so `make clean && make LLAMA_CUBLAS=1` First, for reference `perplexity` running on...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

I got impatient and tried running with batch size 4: ```plaintext Thread 1 "perplexity" received signal SIGSEGV, Segmentation fault. 0x00005555555632d8 in ggml_vec_mul_f32 (n=4096, z=0x7ffe48010000, x=0x7ffe48000000, y=0x7fffa6820400) at ggml.c:2073 2073 inline...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

I said "latest commit" but I lied since you added that while I was composing the message. However https://github.com/ggerganov/llama.cpp/pull/1632/commits/d5d900d75c09f894e3ba0960950ef8b9df7f4aa4 doesn't seem to have made a difference in the behavior: still...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

> I assume you had run `mulmat-tune` bench Sorry, no... I didn't know it was necessary. Is crashing/not working properly the expect behavior in that case? I apologize if I...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

> Looks like the assertion error is cased by clearing tensor.backend in previous commit, I reverted. Unfortunately it doesn't seem to have fixed the issue: ```plaintext ./mulmat-tune bench [bench] model:...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

Quick update: I checked out the latest version and gave it another try. `mulmat-tune bench` now runs, however it doesn't seem to use cuBLAS. [bench] model: 7B, type: Q4_0 I...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

I'm sorry, actually it did work. I just stopped it too early before I guess (since previously it actually said when it was using a CUDA backend). Seems like it...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

> Would you please try less n_threads: 1, 2, 4? Unfortunately, with the latest changes we're back to running into an assertion failure: ```plaintext ggml_init_cublas: found 1 CUDA devices: Device...

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

GGML_ASSERT: ggml.c:10034: comp_backend & GGML_TASK_BACKEND_CPU Looks like `ggml_compute_forward_mul_mat_q_f32` may need a similar change?

Add classifier-free guidance

I'm going to try to look at how to add this to `llm-samplers`. It will need the CFG logits though, so `llm` will need to handle that itself. I guess...

‹
1
2
...
6
7
8
9
10
11
12
...
15
16
›