invalid configuration argument
E:\tools\llama>main.exe -m ....\GPT_MOD\Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_1.bin -ngl 32 main: build = 632 (35a8491) main: seed = 1686234538 ggml_init_cublas: found 4 CUDA devices: Device 0: NVIDIA GeForce RTX 2080 Ti Device 1: NVIDIA GeForce RTX 2080 Ti Device 2: NVIDIA GeForce RTX 2080 Ti Device 3: NVIDIA GeForce RTX 2080 Ti llama.cpp: loading model from ....\GPT_MOD\Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_1.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 3 (mostly Q4_1) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB llama_model_load_internal: using CUDA for GPU acceleration ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 2080 Ti) as main device llama_model_load_internal: mem required = 3756.23 MB (+ 1608.00 MB per state) llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 layers to GPU llama_model_load_internal: total VRAM used: 6564 MB ............................................................................... llama_init_from_file: kv self size = 400.00 MB
system_info: n_threads = 24 / 48 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
CUDA error 9 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:1574: invalid configuration argument
same here
Same here. It seems to happen only when splitting the load across two GPUs. If I use the -ts parameter (described here) to force everything onto one GPU, such as -ts 1,0 or even -ts 0,1, it works. So that's at least a workaround in the meantime, just without multi gpu.
>main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2 --instruct -m Wizard-Vicuna-13B-Uncensored.ggmlv3.q8_0.bin --n-gpu-layers 40
main: build = 635 (5c64a09)
main: seed = 1686175494
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090
Device 1: NVIDIA GeForce RTX 3090
llama.cpp: loading model from Wizard-Vicuna-13B-Uncensored.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 4090) as main device
llama_model_load_internal: mem required = 2380.14 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 layers to GPU
llama_model_load_internal: total VRAM used: 13370 MB
...................................................................................................
llama_init_from_file: kv self size = 1600.00 MB
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Human:'
Reverse prompt: '### Instruction:
'
sampling: repeat_last_n = 64, repeat_penalty = 1.200000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 2
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
CUDA error 9 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:1574: invalid configuration argument
same here on main.exe and server
This issue seems to only occur on Windows systems with multiple graphics cards.
Still happening on latest build 0bf7cf1
Seems to be fixed at least as of 303f580
Getting this error on Linux after compiling with cublas
Same with https://huggingface.co/TheBloke/CausalLM-14B-GGUF
@JoseConseco funny enough it was that exact same model too
yes, this is problem with the model. not with llama. so this is not related to issue in current thread.