whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Convert hugginface model to ggml?

Open jzju opened this issue 1 year ago • 4 comments

Is it possible to have the convert script support hugginface format like the one here https://huggingface.co/openai/whisper-medium/tree/main ? The usecase is to run fine tuned models with cpp.

jzju avatar Nov 18 '22 21:11 jzju

I think this looks like a similar task as the one I did for the GPT-J model:

https://github.com/ggerganov/ggml/tree/master/examples/gpt-j

See the convert script there. If somebody wants to take a shot - else i’ll add support at some. point in the future.

ggerganov avatar Nov 19 '22 11:11 ggerganov

Tried to get it to work but it dont print anything:

$ ./main -m ../ggml-model.bin -f file.wav -l sv                                                                                                                                                                                                                                                
whisper_model_load: loading model from '../ggml-model.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem_required  = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size =    22.83 MB
whisper_model_load: model size  =   140.54 MB

system_info: n_threads = 4 / 16 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | 

main: processing '../file.wav' (1837163 samples, 114.8 sec), 4 threads, 1 processors, lang = sv, task = transcribe, timestamps = 1 ...



whisper_print_timings:     load time =   158.53 ms
whisper_print_timings:      mel time =   623.78 ms
whisper_print_timings:   sample time =   241.47 ms
whisper_print_timings:   encode time =  4833.27 ms / 805.55 ms per layer
whisper_print_timings:   decode time =  3598.64 ms / 599.77 ms per layer
whisper_print_timings:    total time =  9461.59 ms

Code:

conv_map = {'self_attn_layer_norm': 'attn_ln',
 'encoder_attn.k_proj': 'attn.key',
 'self_attn.out_proj': 'attn.out',
 'encoder_attn.out_proj': 'cross_attn.out',
 'self_attn.q_proj': 'attn.query',
 'encoder_attn.q_proj': 'cross_attn.query',
 'self_attn.v_proj': 'attn.value',
 'encoder_attn.v_proj': 'cross_attn.value',
 'encoder_attn_layer_norm': 'cross_attn_ln',
 'fc1': 'mlp.0',
 'fc2': 'mlp.2',
 'final_layer_norm': 'mlp_ln',
 'encoder.layer_norm.bias': 'encoder.ln_post.bias',
 'encoder.layer_norm.weight': 'encoder.ln_post.weight',
 'encoder.embed_positions.weight': 'encoder.positional_embedding',
 'decoder.layer_norm.bias': 'decoder.ln.bias',
 'decoder.layer_norm.weight': 'decoder.ln.weight',
 'decoder.embed_positions.weight': 'decoder.positional_embedding',
 'decoder.embed_tokens.weight': 'decoder.token_embedding.weight',
}

from transformers import WhisperForConditionalGeneration
dir_model = "whisper-base"
with open(dir_model + "/vocab.json", "r") as f:
    encoder = json.load(f)
with open(dir_model + "/added_tokens.json", "r") as f:
    encoder_added = json.load(f)
with open(dir_model + "/config.json", "r") as f:
    hparams = json.load(f)

model = WhisperForConditionalGeneration.from_pretrained(dir_model)
list_vars = model.state_dict()

dir_whisper = "whisper"
dir_out = "."

n_mels = hparams["num_mel_bins"]
with np.load(os.path.join(dir_whisper, "whisper/assets", "mel_filters.npz")) as f:
    filters = torch.from_numpy(f[f"mel_{n_mels}"])

multilingual = hparams["vocab_size"] == 51865
dir_tokenizer = os.path.join(dir_whisper, "whisper/assets", multilingual and "multilingual" or "gpt2")

fname_out = dir_out + "/ggml-model.bin"

with open(dir_tokenizer + "/vocab.json", "r", encoding="utf8") as f:
    tokens = json.load(f)

use_f16 = True

fout = open(fname_out, "wb")

fout.write(struct.pack("i", 0x67676d6c)) # magic: ggml in hex
fout.write(struct.pack("i", hparams["vocab_size"]))
fout.write(struct.pack("i", hparams["max_source_positions"]))
fout.write(struct.pack("i", hparams["d_model"]))
fout.write(struct.pack("i", hparams["decoder_attention_heads"]))
fout.write(struct.pack("i", hparams["decoder_layers"]))
fout.write(struct.pack("i", hparams["max_length"]))
fout.write(struct.pack("i", hparams["d_model"]))
fout.write(struct.pack("i", hparams["encoder_attention_heads"]))
fout.write(struct.pack("i", hparams["encoder_layers"]))
fout.write(struct.pack("i", hparams["num_mel_bins"]))
fout.write(struct.pack("i", use_f16))

fout.write(struct.pack("i", filters.shape[0]))
fout.write(struct.pack("i", filters.shape[1]))
for i in range(filters.shape[0]):
    for j in range(filters.shape[1]):
        fout.write(struct.pack("f", filters[i][j]))

byte_encoder = bytes_to_unicode()
byte_decoder = {v:k for k, v in byte_encoder.items()}

fout.write(struct.pack("i", len(tokens)))

for key in tokens:
    text = bytearray([byte_decoder[c] for c in key])
    fout.write(struct.pack("i", len(text)))
    fout.write(text)


for name in list_vars.keys():
    if name == "proj_out.weight":
        continue
    data = list_vars[name].squeeze().numpy()
    data = data.astype(np.float16)
    nn = name
    nn = nn.split(".")[1:]
    if nn[1] == "layers":
        nn[1] = "blocks"
        if ".".join(nn[3:-1]) == "self_attn.k_proj":
            mapped = "attn.key" if nn[0] == "encoder" else "cross_attn.key"
        else:
            mapped = conv_map[".".join(nn[3:-1])]
        name = ".".join(nn[:3] + [mapped] + nn[-1:])
    else:
        name = ".".join(nn)
        name = conv_map[name] if name in conv_map else name

jzju avatar Nov 19 '22 16:11 jzju

I added a conversion script:

https://github.com/ggerganov/whisper.cpp/blob/master/models/convert-h5-to-ggml.py

Use like this:

git clone https://github.com/openai/whisper
git clone https://github.com/ggerganov/whisper.cpp
git clone https://huggingface.co/openai/whisper-medium

python3 ./whisper.cpp/models/convert-h5-to-ggml.py ./whisper-medium/ ./whisper .

However, just as you noticed, it does not produce any output. The decoded tokens are invalid, but I haven't traced where the computation breaks down.

The proj_out.weight tensor from this model is currently ignored, but maybe it has to be used somehow? It's not present in the original OpenAI model, but it seems to be used by the "transformers" implementation. Not sure

ggerganov avatar Nov 23 '22 15:11 ggerganov

The printour of the whisper model is this: Ending with (proj_out): Linear(in_features=512, out_features=51865, bias=False)

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 512, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(512, 512, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 512)
      (layers): ModuleList(
        (0): WhisperEncoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (1): WhisperEncoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (2): WhisperEncoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (3): WhisperEncoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (4): WhisperEncoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (5): WhisperEncoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
      )
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): WhisperDecoder(
      (embed_tokens): Embedding(51865, 512, padding_idx=50257)
      (embed_positions): WhisperPositionalEmbedding(448, 512)
      (layers): ModuleList(
        (0): WhisperDecoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (1): WhisperDecoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (2): WhisperDecoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (3): WhisperDecoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (4): WhisperDecoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (5): WhisperDecoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): WhisperAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
      )
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
  )
  (proj_out): Linear(in_features=512, out_features=51865, bias=False)
)

jzju avatar Nov 23 '22 17:11 jzju

So looking a bit more into this, I think that proj_out.weight is not actually used:

https://github.com/huggingface/transformers/blob/9a5b84a0076a04fe9596da72e8668069d4f09ea0/src/transformers/models/whisper/modeling_whisper.py#L1099-L1106

Looking at this discussion, it seems like people are still struggling to make the fine-tuned models run with the original code from OpenAI. @luigisaetta claims to have successfully ran it (https://github.com/openai/whisper/discussions/64#discussioncomment-4217106), but they don't clarify what extra processing they did in addition to converting the model tensors (or at least, I don't see it).

Hopefully, they provide some more information on how to convert between the two models and if it is even possible to use the original code base to run HF models.

ggerganov avatar Nov 23 '22 20:11 ggerganov

I made an mistake and mixed up encoder_attn and self_attn, after comparing to https://github.com/luigisaetta/whisper-app/blob/main/match_layers.py It worked after change

Unsure about which is decode and encode here too (its the same value) https://github.com/ggerganov/whisper.cpp/blob/388e9f79ad4c03801d3b2e2d14fb26c4faa938a6/models/convert-h5-to-ggml.py#L98-L103

@@ -9,26 +9,27 @@ import numpy as np

 from transformers import WhisperForConditionalGeneration

 conv_map = {
- 'encoder_attn.k_proj': 'attn.key',
 conv_map = {
+ 'self_attn.k_proj': 'attn.key',

@@ -139,7 +141,7 @@ for name in list_vars.keys():

     if nn[1] == "layers":
         nn[1] = "blocks"
-        if ".".join(nn[3:-1]) == "self_attn.k_proj":
+        if ".".join(nn[3:-1]) == "encoder_attn.k_proj":
             mapped = "attn.key" if nn[0] == "encoder" else "cross_attn.key"
         else:
             mapped = conv_map[".".join(nn[3:-1])]

jzju avatar Nov 23 '22 21:11 jzju

Great! I confirm that it works now. Tried both models:

  • https://huggingface.co/openai/whisper-medium
  • https://huggingface.co/openai/whisper-base.en

Unsure about which is decode and encode here too (its the same value)

First is Encoder, then the Decoder. I fixed the hparams order in the last commit.

ggerganov avatar Nov 23 '22 21:11 ggerganov

Is it possible to have the convert script support hugginface format like the one here https://huggingface.co/openai/whisper-medium/tree/main ? The usecase is to run fine tuned models with cpp.

I don't understand in which way the model you mentioned is different from the model that comes with whisper.cpp (models/ggml-medium.bin). Is it different? Or do you have other models from hugginface in mind? If yes: which?

abelbabel avatar Nov 24 '22 12:11 abelbabel

So looking a bit more into this, I think that proj_out.weight is not actually used:

https://github.com/huggingface/transformers/blob/9a5b84a0076a04fe9596da72e8668069d4f09ea0/src/transformers/models/whisper/modeling_whisper.py#L1099-L1106

Looking at this discussion, it seems like people are still struggling to make the fine-tuned models run with the original code from OpenAI. @luigisaetta claims to have successfully ran it (openai/whisper#64 (reply in thread)), but they don't clarify what extra processing they did in addition to converting the model tensors (or at least, I don't see it).

Hopefully, they provide some more information on how to convert between the two models and if it is even possible to use the original code base to run HF models.

Hi @ggerganov I have created an utility, based on an idea from @larsh0103 that first create a dictionary with the mapping of the names of layers (the keys in the state_dict). You can see how to use it and how to load a custom tuned model here:
https://github.com/luigisaetta/whisper-app/blob/main/match_layers.py

luigisaetta avatar Nov 24 '22 13:11 luigisaetta

@luigisaetta Hi, thanks for the help! We figured it out yesterday together with @jzju and we can now use HF models with whisper.cpp.

@abelbabel Yes, that model is the same. But it is possible to fine-tune your own model and it will be saved in the same format which you can now import and use in whisper.cpp

ggerganov avatar Nov 24 '22 15:11 ggerganov