llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

How to run CLIP model on GPU?

Open zhu-j-faceonlive opened this issue 11 months ago • 2 comments

I build package on cuda, so llama running on GPU. But CLIP part still on CPU. How to fix it? Thanks.

clip_model_load: loaded meta data with 19 key-value pairs and 455 tensors from .../models/minicpm/mmproj-model-f16.gguf clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. clip_model_load: - kv 0: general.architecture str = clip clip_model_load: - kv 1: clip.has_text_encoder bool = false clip_model_load: - kv 2: clip.has_vision_encoder bool = true clip_model_load: - kv 3: clip.has_minicpmv_projector bool = true clip_model_load: - kv 4: general.file_type u32 = 1 clip_model_load: - kv 5: general.description str = image encoder for MiniCPM-V clip_model_load: - kv 6: clip.projector_type str = resampler clip_model_load: - kv 7: clip.minicpmv_version i32 = 3 clip_model_load: - kv 8: clip.vision.image_size u32 = 448 clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1152 clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4304 clip_model_load: - kv 12: clip.vision.projection_dim u32 = 0 clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001 clip_model_load: - kv 15: clip.vision.block_count u32 = 26 clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000] clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000] clip_model_load: - kv 18: clip.use_gelu bool = true clip_model_load: - type f32: 285 tensors clip_model_load: - type f16: 170 tensors clip_model_load: CLIP using CPU backend key clip.use_silu not found in file clip_model_load: params backend buffer size = 996.02 MB (455 tensors) key clip.vision.image_grid_pinpoints not found in file key clip.vision.mm_patch_merge_type not found in file key clip.vision.image_crop_resolution not found in file clip_image_build_graph: 448 448 clip_model_load: compute allocated memory: 102.80 MB ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX A6000) - 47995 MiB free llama_model_loader: loaded meta data with 22 key-value pairs and 339 tensors from /home/paperspace/Documents/AddrEngine/Pro/models/minicpm/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = model llama_model_loader: - kv 2: qwen2.block_count u32 = 28 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151666] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151666] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151644 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 128244 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors

zhu-j-faceonlive avatar Feb 03 '25 03:02 zhu-j-faceonlive

This seems to be still an issue.

$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_USE_CUDA=1
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 GGML_USE_CUDA=1 cmake --build build --config Release -j 8
$ GGML_USE_CUDA=1 build/bin/llama-qwen2vl-cli -m models/vl7b_instruct_q5_k_l.gguf --mmproj models/mmproj-7b_16f.gguf -p 'Perform optical character recognition (OCR).' --image ~/myimage.png -ngl 29

-------------------------------------------------
* stripped unnecessary stuff *

load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:        CUDA0 model buffer size =  4955.47 MiB
load_tensors:   CPU_Mapped model buffer size =   552.23 MiB
clip_model_load: model name:   Qwen2-VL-7B-Instruct
clip_model_load: description:  image encoder for Qwen2VL
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    521
clip_model_load: n_kv:         20
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 20 key-value pairs and 521 tensors from models/mmproj-7b.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                        general.description str              = image encoder for Qwen2VL
clip_model_load: - kv   2:                          general.file_type u32              = 1
clip_model_load: - kv   3:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   4:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   5:                    clip.has_qwen2vl_merger bool             = true
clip_model_load: - kv   6:                        clip.projector_type str              = qwen2vl_merger
clip_model_load: - kv   7:                              clip.use_silu bool             = false
clip_model_load: - kv   8:                              clip.use_gelu bool             = false
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:                     clip.vision.image_size u32              = 560
clip_model_load: - kv  11:               clip.vision.embedding_length u32              = 1280
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 3584
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 32
clip_model_load: - kv  16:            clip.vision.feed_forward_length u32              = 0
clip_model_load: - kv  17:                               general.name str              = Qwen2-VL-7B-Instruct
clip_model_load: - kv  18:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  19:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - type  f32:  325 tensors
clip_model_load: - type  f16:  196 tensors
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  0
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     1289.95 MB
clip_model_load: metadata size:  0.18 MB

For the record CLIP run on a JPEG image takes 32 seconds using CPU. With the 2B VL model the processing is almost instantaneous on an AM5 system.

peltax avatar Feb 17 '25 18:02 peltax

Current version of llama.cpp has disalbed to run clip on gpu. However, I don't know why it still runs on cpu when I've installed the older version.

2U1 avatar Feb 21 '25 00:02 2U1