How to run CLIP model on GPU?
I build package on cuda, so llama running on GPU. But CLIP part still on CPU. How to fix it? Thanks.
clip_model_load: loaded meta data with 19 key-value pairs and 455 tensors from .../models/minicpm/mmproj-model-f16.gguf clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output. clip_model_load: - kv 0: general.architecture str = clip clip_model_load: - kv 1: clip.has_text_encoder bool = false clip_model_load: - kv 2: clip.has_vision_encoder bool = true clip_model_load: - kv 3: clip.has_minicpmv_projector bool = true clip_model_load: - kv 4: general.file_type u32 = 1 clip_model_load: - kv 5: general.description str = image encoder for MiniCPM-V clip_model_load: - kv 6: clip.projector_type str = resampler clip_model_load: - kv 7: clip.minicpmv_version i32 = 3 clip_model_load: - kv 8: clip.vision.image_size u32 = 448 clip_model_load: - kv 9: clip.vision.patch_size u32 = 14 clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1152 clip_model_load: - kv 11: clip.vision.feed_forward_length u32 = 4304 clip_model_load: - kv 12: clip.vision.projection_dim u32 = 0 clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16 clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001 clip_model_load: - kv 15: clip.vision.block_count u32 = 26 clip_model_load: - kv 16: clip.vision.image_mean arr[f32,3] = [0.500000, 0.500000, 0.500000] clip_model_load: - kv 17: clip.vision.image_std arr[f32,3] = [0.500000, 0.500000, 0.500000] clip_model_load: - kv 18: clip.use_gelu bool = true clip_model_load: - type f32: 285 tensors clip_model_load: - type f16: 170 tensors clip_model_load: CLIP using CPU backend key clip.use_silu not found in file clip_model_load: params backend buffer size = 996.02 MB (455 tensors) key clip.vision.image_grid_pinpoints not found in file key clip.vision.mm_patch_merge_type not found in file key clip.vision.image_crop_resolution not found in file clip_image_build_graph: 448 448 clip_model_load: compute allocated memory: 102.80 MB ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX A6000) - 47995 MiB free llama_model_loader: loaded meta data with 22 key-value pairs and 339 tensors from /home/paperspace/Documents/AddrEngine/Pro/models/minicpm/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = model llama_model_loader: - kv 2: qwen2.block_count u32 = 28 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151666] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151666] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151644 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 128244 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors
This seems to be still an issue.
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_USE_CUDA=1
$ GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 GGML_USE_CUDA=1 cmake --build build --config Release -j 8
$ GGML_USE_CUDA=1 build/bin/llama-qwen2vl-cli -m models/vl7b_instruct_q5_k_l.gguf --mmproj models/mmproj-7b_16f.gguf -p 'Perform optical character recognition (OCR).' --image ~/myimage.png -ngl 29
-------------------------------------------------
* stripped unnecessary stuff *
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CUDA0 model buffer size = 4955.47 MiB
load_tensors: CPU_Mapped model buffer size = 552.23 MiB
clip_model_load: model name: Qwen2-VL-7B-Instruct
clip_model_load: description: image encoder for Qwen2VL
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 521
clip_model_load: n_kv: 20
clip_model_load: ftype: f16
clip_model_load: loaded meta data with 20 key-value pairs and 521 tensors from models/mmproj-7b.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: general.description str = image encoder for Qwen2VL
clip_model_load: - kv 2: general.file_type u32 = 1
clip_model_load: - kv 3: clip.has_text_encoder bool = false
clip_model_load: - kv 4: clip.has_vision_encoder bool = true
clip_model_load: - kv 5: clip.has_qwen2vl_merger bool = true
clip_model_load: - kv 6: clip.projector_type str = qwen2vl_merger
clip_model_load: - kv 7: clip.use_silu bool = false
clip_model_load: - kv 8: clip.use_gelu bool = false
clip_model_load: - kv 9: clip.vision.patch_size u32 = 14
clip_model_load: - kv 10: clip.vision.image_size u32 = 560
clip_model_load: - kv 11: clip.vision.embedding_length u32 = 1280
clip_model_load: - kv 12: clip.vision.projection_dim u32 = 3584
clip_model_load: - kv 13: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 14: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 15: clip.vision.block_count u32 = 32
clip_model_load: - kv 16: clip.vision.feed_forward_length u32 = 0
clip_model_load: - kv 17: general.name str = Qwen2-VL-7B-Instruct
clip_model_load: - kv 18: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv 19: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777]
clip_model_load: - type f32: 325 tensors
clip_model_load: - type f16: 196 tensors
clip_model_load: CLIP using CPU backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 0
clip_model_load: minicpmv_projector: 0
clip_model_load: model size: 1289.95 MB
clip_model_load: metadata size: 0.18 MB
For the record CLIP run on a JPEG image takes 32 seconds using CPU. With the 2B VL model the processing is almost instantaneous on an AM5 system.
Current version of llama.cpp has disalbed to run clip on gpu. However, I don't know why it still runs on cpu when I've installed the older version.