llama.cpp Misc. bug: Quantization process 100 times slower on Windows (dockerized)

Misc. bug: Quantization process 100 times slower on Windows (dockerized)

Open dclipca opened this issue 1 week ago • 0 comments

Name and Version

llama-quantize, build = 4691 (369be559)

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-quantize

Command line

llama_cpp/build/bin/llama-quantize quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated-Q2_K.gguf Q2_K

Problem description & steps to reproduce

Issue Summary: The quantization process using llama.cpp takes 100 times longer on Windows (dockerized) compared to Linux, with the container using only 1% of the CPU despite being capable of using all available cores.

Steps to Reproduce:

Run llama-quantize on Windows and Linux with the same model (I am doing this in SpongeQuant, with the CPU only option).
Observe the CPU usage during the quantization process (using htop on Linux and Task Manager or docker stats on Windows).
Compare the time taken for the same operation between Windows and Linux.

Expected Behavior: The quantization process should perform similarly on both Windows and Linux, utilizing the available CPU resources efficiently.

Actual Behavior: On Windows:

The process is extremely slow, taking about 100 times longer than on Linux.
The process uses only 1% of the CPU throughout the execution (but spike at around 80% of CPU when it starts). On Linux, the process runs as expected with full CPU utilization and normal speed.

Additional Information:

The issue persists regardless of the Docker container configuration or resource limits.
The issue is related to CPU usage and performance throttling, but no specific throttling settings have been identified on Windows.
I have tested with the stress package, and the container is capable of using 100% of the CPU, but quantization is still limited.

Container Setup: The issue happens in a container using Ubuntu 22.04, with Python dependencies and llama.cpp compiled from source. WSL2 is being used for Windows execution.

# -------------------------------------------------------------------------------
# Use a plain Ubuntu image for CPU-only mode.
# -------------------------------------------------------------------------------
FROM ubuntu:22.04

# -------------------------------------------------------------------------------
# Disable interactive prompts.
# -------------------------------------------------------------------------------
ENV DEBIAN_FRONTEND=noninteractive

# -------------------------------------------------------------------------------
# Install required system dependencies.
# -------------------------------------------------------------------------------
RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    git \
    curl \
    wget \
    ninja-build \
    python3 \
    python3-pip \
    libssl-dev \
    libffi-dev \
 && rm -rf /var/lib/apt/lists/*

# -------------------------------------------------------------------------------
# Set the working directory.
# -------------------------------------------------------------------------------
WORKDIR /app

# -------------------------------------------------------------------------------
# Create a cache directory and set environment variables.
# -------------------------------------------------------------------------------
RUN mkdir -p /app/.cache && chmod -R 777 /app/.cache
ENV HF_HOME=/app/.cache
ENV HOME=/app

# -------------------------------------------------------------------------------
# Copy the requirements file.
# -------------------------------------------------------------------------------
COPY ./app/requirements.cpu.txt /app/

# -------------------------------------------------------------------------------
# Upgrade pip.
# -------------------------------------------------------------------------------
RUN python3 -m pip install --upgrade pip==25.0

# -------------------------------------------------------------------------------
# Force-install torch first so that auto-gptq’s metadata generation finds it.
# -------------------------------------------------------------------------------
RUN python3 -m pip install torch==2.6.0

# -------------------------------------------------------------------------------
# Install the rest of the Python dependencies.
# -------------------------------------------------------------------------------
RUN python3 -m pip install -r requirements.cpu.txt

# -------------------------------------------------------------------------------
# Clone and build llama_cpp (for GGUF quantization).
# -------------------------------------------------------------------------------
RUN git clone https://github.com/ggerganov/llama.cpp.git /app/llama_cpp
WORKDIR /app/llama_cpp
RUN mkdir build && cd build && \
    cmake -DCMAKE_BUILD_TYPE=Release \
          -G Ninja .. && \
    ninja -j$(nproc)

# -------------------------------------------------------------------------------
# Copy the rest of your application files.
# -------------------------------------------------------------------------------
COPY ./app /app
WORKDIR /app

# -------------------------------------------------------------------------------
# Expose the port (for Gradio UI, for example) and set the entrypoint.
# -------------------------------------------------------------------------------
EXPOSE 7860
CMD ["python3", "app.py"]

First Bad Commit

No response

Relevant log output

=== Starting SpongeQuant Quantization Process ===

=== Processing model: mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated ===
=== Downloading Model ===
[INFO] Model ID: mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated
[INFO] Target directory: models/Meta-Llama-3.1-8B-Instruct-abliterated
[INFO] Model mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated is already fully downloaded at models/Meta-Llama-3.1-8B-Instruct-abliterated. Skipping download.
[INFO] Running GGUF quantization...
=== GGUF Quantization for Meta-Llama-3.1-8B-Instruct-abliterated ===
[INFO] Expected output file: quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf
[INFO] File quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf already exists. Skipping conversion.
[WARN] Skipping IQ2_XXS quantization because imatrix is not enabled.
[WARN] Skipping IQ2_XS quantization because imatrix is not enabled.
[WARN] Skipping IQ2_S quantization because imatrix is not enabled.
[WARN] Skipping IQ2_M quantization because imatrix is not enabled.
[WARN] Skipping IQ3_XXS quantization because imatrix is not enabled.
[WARN] Skipping IQ3_S quantization because imatrix is not enabled.
[WARN] Skipping IQ3_M quantization because imatrix is not enabled.
[WARN] Skipping IQ3_XS quantization because imatrix is not enabled.
[WARN] Skipping IQ4_XS quantization because imatrix is not enabled.
[WARN] Skipping IQ4_NL quantization because imatrix is not enabled.
[INFO] Quantizing with method 'Q2_K':
  "llama_cpp/build/bin/llama-quantize" quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated-Q2_K.gguf Q2_K
[DEBUG] Executing command: "llama_cpp/build/bin/llama-quantize" quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated-Q2_K.gguf Q2_K
main: build = 4691 (369be559)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf' to 'quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated-Q2_K.gguf' as Q2_K
llama_model_loader: loaded meta data with 34 key-value pairs and 291 tensors from quantized_models/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/meta-llama-3.1-8b-instruct-abliterated.bf16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct Abliterated
llama_model_loader: - kv   3:                           general.finetune str              = Instruct-abliterated
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Meta Llama
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Met...
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["abliterated", "uncensored"]
llama_model_loader: - kv  12:                          llama.block_count u32              = 32
llama_model_loader: - kv  13:                       llama.context_length u32              = 131072
llama_model_loader: - kv  14:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  15:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  16:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  17:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  19:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  20:                          general.file_type u32              = 32
llama_model_loader: - kv  21:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  22:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  23:                    llama.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                  llama.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type bf16:  226 tensors
[   1/ 291]                        output.weight - [ 4096, 128256,     1,     1], type =   bf16, converting to q6_K .. size =  1002.00 MiB ->   410.98 MiB
[   2/ 291]                   output_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   3/ 291]                    token_embd.weight - [ 4096, 128256,     1,     1], type =   bf16, converting to q2_K .. size =  1002.00 MiB ->   164.39 MiB
[   4/ 291]                  blk.0.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[   5/ 291]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   6/ 291]             blk.0.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[   7/ 291]                  blk.0.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[   8/ 291]                  blk.0.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[   9/ 291]                blk.0.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[  10/ 291]                blk.0.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  11/ 291]                blk.0.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  12/ 291]                  blk.0.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  13/ 291]                  blk.1.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[  14/ 291]               blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  15/ 291]             blk.1.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[  16/ 291]                  blk.1.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[  17/ 291]                  blk.1.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  18/ 291]                blk.1.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[  19/ 291]                blk.1.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  20/ 291]                blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  21/ 291]                  blk.1.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  22/ 291]                  blk.2.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[  23/ 291]               blk.2.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  24/ 291]             blk.2.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[  25/ 291]                  blk.2.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[  26/ 291]                  blk.2.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  27/ 291]                blk.2.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[  28/ 291]                blk.2.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  29/ 291]                blk.2.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  30/ 291]                  blk.2.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  31/ 291]                  blk.3.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[  32/ 291]               blk.3.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  33/ 291]             blk.3.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[  34/ 291]                  blk.3.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[  35/ 291]                  blk.3.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  36/ 291]                blk.3.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[  37/ 291]                blk.3.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  38/ 291]                blk.3.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  39/ 291]                  blk.3.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  40/ 291]                  blk.4.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[  41/ 291]               blk.4.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  42/ 291]             blk.4.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[  43/ 291]                  blk.4.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[  44/ 291]                  blk.4.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  45/ 291]                blk.4.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[  46/ 291]                blk.4.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  47/ 291]                blk.4.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  48/ 291]                  blk.4.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  49/ 291]                  blk.5.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[  50/ 291]               blk.5.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  51/ 291]             blk.5.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[  52/ 291]                  blk.5.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[  53/ 291]                  blk.5.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  54/ 291]                blk.5.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[  55/ 291]                blk.5.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  56/ 291]                blk.5.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  57/ 291]                  blk.5.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  58/ 291]                  blk.6.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[  59/ 291]               blk.6.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  60/ 291]             blk.6.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[  61/ 291]                  blk.6.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[  62/ 291]                  blk.6.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  63/ 291]                blk.6.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[  64/ 291]                blk.6.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  65/ 291]                blk.6.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  66/ 291]                  blk.6.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  67/ 291]                  blk.7.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[  68/ 291]               blk.7.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  69/ 291]             blk.7.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[  70/ 291]                  blk.7.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[  71/ 291]                  blk.7.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  72/ 291]                blk.7.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[  73/ 291]                blk.7.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  74/ 291]                blk.7.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  75/ 291]                  blk.7.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  76/ 291]                  blk.8.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[  77/ 291]               blk.8.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  78/ 291]             blk.8.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[  79/ 291]                  blk.8.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[  80/ 291]                  blk.8.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  81/ 291]                blk.8.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[  82/ 291]                blk.8.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  83/ 291]                blk.8.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  84/ 291]                  blk.8.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  85/ 291]                  blk.9.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[  86/ 291]               blk.9.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  87/ 291]             blk.9.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[  88/ 291]                  blk.9.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[  89/ 291]                  blk.9.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  90/ 291]                blk.9.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[  91/ 291]                blk.9.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  92/ 291]                blk.9.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  93/ 291]                  blk.9.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[  94/ 291]                 blk.10.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[  95/ 291]              blk.10.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  96/ 291]            blk.10.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[  97/ 291]                 blk.10.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[  98/ 291]                 blk.10.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  99/ 291]               blk.10.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[ 100/ 291]               blk.10.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[ 101/ 291]               blk.10.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 102/ 291]                 blk.10.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[ 103/ 291]                 blk.11.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[ 104/ 291]              blk.11.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 105/ 291]            blk.11.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[ 106/ 291]                 blk.11.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[ 107/ 291]                 blk.11.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 108/ 291]               blk.11.ffn_down.weight - [14336,  4096,     1,     1], type =   bf16, converting to q3_K .. size =   112.00 MiB ->    24.06 MiB
[ 109/ 291]               blk.11.ffn_gate.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[ 110/ 291]               blk.11.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 111/ 291]                 blk.11.ffn_up.weight - [ 4096, 14336,     1,     1], type =   bf16, converting to q2_K .. size =   112.00 MiB ->    18.38 MiB
[ 112/ 291]                 blk.12.attn_k.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q2_K .. size =     8.00 MiB ->     1.31 MiB
[ 113/ 291]              blk.12.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 114/ 291]            blk.12.attn_output.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q3_K .. size =    32.00 MiB ->     6.88 MiB
[ 115/ 291]                 blk.12.attn_q.weight - [ 4096,  4096,     1,     1], type =   bf16, converting to q2_K .. size =    32.00 MiB ->     5.25 MiB
[ 116/ 291]                 blk.12.attn_v.weight - [ 4096,  1024,     1,     1], type =   bf16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB

Feb 12 '25 13:02 dclipca

llama.cpp llama.cpp copied to clipboard

Misc. bug: Quantization process 100 times slower on Windows (dockerized)

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

llama.cpp
llama.cpp copied to clipboard