CPU buffer size issue
I used microsoft/bitnet-b1.58-2B-4T-gguf model
Setup the environment:
python setup_env.py -md ./models/BitNet-b1.58-2B-4T/ -q i2_s
usage with:
python run_inference.py -m ./models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv -n 1024
and output
llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.15 MiB llm_load_tensors: CPU buffer size = 1124.81 MiB ............................... llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 32 llama_new_context_with_model: n_ubatch = 32 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 150.00 MiB llama_new_context_with_model: KV self size = 150.00 MiB, K (f16): 75.00 MiB, V (f16): 75.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB llama_new_context_with_model: CPU compute buffer size = 15.97 MiB llama_new_context_with_model: graph nodes = 1116 llama_new_context_with_model: graph splits = 1 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 2
My CPU buffer size is 1124.81 MiB Why CPU usage is significantly higher than 0.4GB? Is it an issue with my cmd args?