gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

[Feature] Vulkan acceleration for more quantization types

Open TiagoSantos81 opened this issue 9 months ago • 3 comments

System Info

GPT4ALL 2.5.0 desktop version on Windows 10 x64. Two systems, both with NVidia GPUs.

Information

  • [ ] The official example notebooks/scripts
  • [ ] My own modified scripts

Reproduction

  1. Load any Mistral base model with 4_0 quantization, as the default models on GPT4ALL Chat, on a GPU with more than 6GB free memory.
  2. Change default pre-load models to an equivalent model with 3_K_M (smaller), and restart the application (due to #1550).
  3. Run a short prompt and check below the speed metrics if the models was loaded to the GPU, on in any memory profiling app.

Expected behavior

Smaller K_M quantized GGUF models should fit the same GPU as 4_0 and 5_0 ones.

TiagoSantos81 avatar Oct 22 '23 20:10 TiagoSantos81

Currently, GPU offloading is only supported for models based on the LLaMA for Falcon architecture stored in the Q4_0, Q4_1, fp16, or fp32 formats. If you attempt to load an unsupported model, there should be a message in the lower-right corner while it is generating that it is using the CPU due to an unsupported model type/format.

cebtenzzre avatar Oct 22 '23 20:10 cebtenzzre

@cebtenzzre, Hello, that does not seem to be the case at least for version 2.5.0:

image

nov. 02 11:56:41 HOST plasmashell[280926]: ggml_vk_graph_compute: MUL_MAT: Unsupported quantization: 13/0 nov. 02 11:56:41 HOST plasmashell[280926]: ggml_vk_graph_compute: node 942, op = MUL_MAT not implemented

Looks like it's the same result for any quantized model, even LLAMA ones.

DistantThunder avatar Nov 02 '23 10:11 DistantThunder

that does not seem to be the case at least for version 2.5.0:

There is a bug in the detection of unsupported quantizations that was fixed in https://github.com/nomic-ai/llama.cpp/pull/11 and should be resolved in the next release.

cebtenzzre avatar Nov 02 '23 16:11 cebtenzzre