llama.cpp
llama.cpp copied to clipboard
Misc. bug: Quantization process 100 times slower on Windows (dockerized)
Name and Version
llama-quantize, build = 4691 (369be559)
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-quantize
Command line
llama_cpp/build/bin/llama-quantize quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated-Q2_K.gguf Q2_K
Problem description & steps to reproduce
Issue Summary: The quantization process using llama.cpp takes 100 times longer on Windows (dockerized) compared to Linux, with the container using only 1% of the CPU despite being capable of using all available cores.
Steps to Reproduce:
- Run llama-quantize on Windows and Linux with the same model (I am doing this in SpongeQuant, with the CPU only option).
- Observe the CPU usage during the quantization process (using
htop
on Linux and Task Manager ordocker stats
on Windows). - Compare the time taken for the same operation between Windows and Linux.
Expected Behavior: The quantization process should perform similarly on both Windows and Linux, utilizing the available CPU resources efficiently.
Actual Behavior: On Windows:
- The process is extremely slow, taking about 100 times longer than on Linux.
- The process uses only 1% of the CPU throughout the execution (but spike at around 80% of CPU when it starts). On Linux, the process runs as expected with full CPU utilization and normal speed.
Additional Information:
- The issue persists regardless of the Docker container configuration or resource limits.
- The issue is related to CPU usage and performance throttling, but no specific throttling settings have been identified on Windows.
- I have tested with the
stress
package, and the container is capable of using 100% of the CPU, but quantization is still limited.
Container Setup: The issue happens in a container using Ubuntu 22.04, with Python dependencies and llama.cpp compiled from source. WSL2 is being used for Windows execution.
# -------------------------------------------------------------------------------
# Use a plain Ubuntu image for CPU-only mode.
# -------------------------------------------------------------------------------
FROM ubuntu:22.04
# -------------------------------------------------------------------------------
# Disable interactive prompts.
# -------------------------------------------------------------------------------
ENV DEBIAN_FRONTEND=noninteractive
# -------------------------------------------------------------------------------
# Install required system dependencies.
# -------------------------------------------------------------------------------
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
curl \
wget \
ninja-build \
python3 \
python3-pip \
libssl-dev \
libffi-dev \
&& rm -rf /var/lib/apt/lists/*
# -------------------------------------------------------------------------------
# Set the working directory.
# -------------------------------------------------------------------------------
WORKDIR /app
# -------------------------------------------------------------------------------
# Create a cache directory and set environment variables.
# -------------------------------------------------------------------------------
RUN mkdir -p /app/.cache && chmod -R 777 /app/.cache
ENV HF_HOME=/app/.cache
ENV HOME=/app
# -------------------------------------------------------------------------------
# Copy the requirements file.
# -------------------------------------------------------------------------------
COPY ./app/requirements.cpu.txt /app/
# -------------------------------------------------------------------------------
# Upgrade pip.
# -------------------------------------------------------------------------------
RUN python3 -m pip install --upgrade pip==25.0
# -------------------------------------------------------------------------------
# Force-install torch first so that auto-gptq’s metadata generation finds it.
# -------------------------------------------------------------------------------
RUN python3 -m pip install torch==2.6.0
# -------------------------------------------------------------------------------
# Install the rest of the Python dependencies.
# -------------------------------------------------------------------------------
RUN python3 -m pip install -r requirements.cpu.txt
# -------------------------------------------------------------------------------
# Clone and build llama_cpp (for GGUF quantization).
# -------------------------------------------------------------------------------
RUN git clone https://github.com/ggerganov/llama.cpp.git /app/llama_cpp
WORKDIR /app/llama_cpp
RUN mkdir build && cd build && \
cmake -DCMAKE_BUILD_TYPE=Release \
-G Ninja .. && \
ninja -j$(nproc)
# -------------------------------------------------------------------------------
# Copy the rest of your application files.
# -------------------------------------------------------------------------------
COPY ./app /app
WORKDIR /app
# -------------------------------------------------------------------------------
# Expose the port (for Gradio UI, for example) and set the entrypoint.
# -------------------------------------------------------------------------------
EXPOSE 7860
CMD ["python3", "app.py"]
First Bad Commit
No response
Relevant log output
=== Starting SpongeQuant Quantization Process ===
=== Processing model: mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated ===
=== Downloading Model ===
[INFO] Model ID: mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated
[INFO] Target directory: models/Meta-Llama-3.1-8B-Instruct-abliterated
[INFO] Model mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated is already fully downloaded at models/Meta-Llama-3.1-8B-Instruct-abliterated. Skipping download.
[INFO] Running GGUF quantization...
=== GGUF Quantization for Meta-Llama-3.1-8B-Instruct-abliterated ===
[INFO] Expected output file: quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf
[INFO] File quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf already exists. Skipping conversion.
[WARN] Skipping IQ2_XXS quantization because imatrix is not enabled.
[WARN] Skipping IQ2_XS quantization because imatrix is not enabled.
[WARN] Skipping IQ2_S quantization because imatrix is not enabled.
[WARN] Skipping IQ2_M quantization because imatrix is not enabled.
[WARN] Skipping IQ3_XXS quantization because imatrix is not enabled.
[WARN] Skipping IQ3_S quantization because imatrix is not enabled.
[WARN] Skipping IQ3_M quantization because imatrix is not enabled.
[WARN] Skipping IQ3_XS quantization because imatrix is not enabled.
[WARN] Skipping IQ4_XS quantization because imatrix is not enabled.
[WARN] Skipping IQ4_NL quantization because imatrix is not enabled.
[INFO] Quantizing with method 'Q2_K':
"llama_cpp/build/bin/llama-quantize" quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated-Q2_K.gguf Q2_K
[DEBUG] Executing command: "llama_cpp/build/bin/llama-quantize" quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated-Q2_K.gguf Q2_K
main: build = 4691 (369be559)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf' to 'quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated-Q2_K.gguf' as Q2_K
llama_model_loader: loaded meta data with 34 key-value pairs and 291 tensors from quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Abliterated
llama_model_loader: - kv 3: general.finetune str = Instruct-abliterated
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Met...
llama_model_loader: - kv 11: general.tags arr[str,2] = ["abliterated", "uncensored"]
llama_model_loader: - kv 12: llama.block_count u32 = 32
llama_model_loader: - kv 13: llama.context_length u32 = 131072
llama_model_loader: - kv 14: llama.embedding_length u32 = 4096
llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 18: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 20: general.file_type u32 = 32
llama_model_loader: - kv 21: llama.vocab_size u32 = 128256
llama_model_loader: - kv 22: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 23: llama.rope.scaling.type str = linear
llama_model_loader: - kv 24: llama.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 26: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 32: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type bf16: 226 tensors
[ 1/ 291] output.weight - [ 4096, 128256, 1, 1], type = bf16, converting to q6_K .. size = 1002.00 MiB -> 410.98 MiB
[ 2/ 291] output_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 3/ 291] token_embd.weight - [ 4096, 128256, 1, 1], type = bf16, converting to q2_K .. size = 1002.00 MiB -> 164.39 MiB
[ 4/ 291] blk.0.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 5/ 291] blk.0.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 6/ 291] blk.0.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 7/ 291] blk.0.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 8/ 291] blk.0.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 9/ 291] blk.0.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 10/ 291] blk.0.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 11/ 291] blk.0.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 12/ 291] blk.0.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 13/ 291] blk.1.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 14/ 291] blk.1.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 15/ 291] blk.1.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 16/ 291] blk.1.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 17/ 291] blk.1.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 18/ 291] blk.1.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 19/ 291] blk.1.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 20/ 291] blk.1.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 21/ 291] blk.1.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 22/ 291] blk.2.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 23/ 291] blk.2.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 24/ 291] blk.2.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 25/ 291] blk.2.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 26/ 291] blk.2.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 27/ 291] blk.2.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 28/ 291] blk.2.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 29/ 291] blk.2.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 30/ 291] blk.2.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 31/ 291] blk.3.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 32/ 291] blk.3.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 33/ 291] blk.3.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 34/ 291] blk.3.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 35/ 291] blk.3.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 36/ 291] blk.3.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 37/ 291] blk.3.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 38/ 291] blk.3.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 39/ 291] blk.3.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 40/ 291] blk.4.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 41/ 291] blk.4.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 42/ 291] blk.4.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 43/ 291] blk.4.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 44/ 291] blk.4.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 45/ 291] blk.4.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 46/ 291] blk.4.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 47/ 291] blk.4.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 48/ 291] blk.4.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 49/ 291] blk.5.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 50/ 291] blk.5.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 51/ 291] blk.5.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 52/ 291] blk.5.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 53/ 291] blk.5.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 54/ 291] blk.5.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 55/ 291] blk.5.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 56/ 291] blk.5.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 57/ 291] blk.5.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 58/ 291] blk.6.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 59/ 291] blk.6.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 60/ 291] blk.6.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 61/ 291] blk.6.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 62/ 291] blk.6.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 63/ 291] blk.6.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 64/ 291] blk.6.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 65/ 291] blk.6.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 66/ 291] blk.6.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 67/ 291] blk.7.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 68/ 291] blk.7.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 69/ 291] blk.7.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 70/ 291] blk.7.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 71/ 291] blk.7.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 72/ 291] blk.7.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 73/ 291] blk.7.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 74/ 291] blk.7.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 75/ 291] blk.7.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 76/ 291] blk.8.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 77/ 291] blk.8.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 78/ 291] blk.8.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 79/ 291] blk.8.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 80/ 291] blk.8.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 81/ 291] blk.8.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 82/ 291] blk.8.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 83/ 291] blk.8.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 84/ 291] blk.8.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 85/ 291] blk.9.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 86/ 291] blk.9.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 87/ 291] blk.9.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 88/ 291] blk.9.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 89/ 291] blk.9.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 90/ 291] blk.9.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 91/ 291] blk.9.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 92/ 291] blk.9.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 93/ 291] blk.9.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 94/ 291] blk.10.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 95/ 291] blk.10.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 96/ 291] blk.10.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 97/ 291] blk.10.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 98/ 291] blk.10.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 99/ 291] blk.10.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 100/ 291] blk.10.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 101/ 291] blk.10.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 102/ 291] blk.10.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 103/ 291] blk.11.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 104/ 291] blk.11.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 105/ 291] blk.11.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 106/ 291] blk.11.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 107/ 291] blk.11.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB
[ 108/ 291] blk.11.ffn_down.weight - [14336, 4096, 1, 1], type = bf16, converting to q3_K .. size = 112.00 MiB -> 24.06 MiB
[ 109/ 291] blk.11.ffn_gate.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 110/ 291] blk.11.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 111/ 291] blk.11.ffn_up.weight - [ 4096, 14336, 1, 1], type = bf16, converting to q2_K .. size = 112.00 MiB -> 18.38 MiB
[ 112/ 291] blk.12.attn_k.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q2_K .. size = 8.00 MiB -> 1.31 MiB
[ 113/ 291] blk.12.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 114/ 291] blk.12.attn_output.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q3_K .. size = 32.00 MiB -> 6.88 MiB
[ 115/ 291] blk.12.attn_q.weight - [ 4096, 4096, 1, 1], type = bf16, converting to q2_K .. size = 32.00 MiB -> 5.25 MiB
[ 116/ 291] blk.12.attn_v.weight - [ 4096, 1024, 1, 1], type = bf16, converting to q4_K .. size = 8.00 MiB -> 2.25 MiB