whisper.cpp
whisper.cpp copied to clipboard
Convert hugginface model to ggml?
Is it possible to have the convert script support hugginface format like the one here https://huggingface.co/openai/whisper-medium/tree/main ? The usecase is to run fine tuned models with cpp.
I think this looks like a similar task as the one I did for the GPT-J model:
https://github.com/ggerganov/ggml/tree/master/examples/gpt-j
See the convert script there. If somebody wants to take a shot - else i’ll add support at some. point in the future.
Tried to get it to work but it dont print anything:
$ ./main -m ../ggml-model.bin -f file.wav -l sv
whisper_model_load: loading model from '../ggml-model.bin'
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem_required = 506.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 140.60 MB
whisper_model_load: memory size = 22.83 MB
whisper_model_load: model size = 140.54 MB
system_info: n_threads = 4 / 16 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 |
main: processing '../file.wav' (1837163 samples, 114.8 sec), 4 threads, 1 processors, lang = sv, task = transcribe, timestamps = 1 ...
whisper_print_timings: load time = 158.53 ms
whisper_print_timings: mel time = 623.78 ms
whisper_print_timings: sample time = 241.47 ms
whisper_print_timings: encode time = 4833.27 ms / 805.55 ms per layer
whisper_print_timings: decode time = 3598.64 ms / 599.77 ms per layer
whisper_print_timings: total time = 9461.59 ms
Code:
conv_map = {'self_attn_layer_norm': 'attn_ln',
'encoder_attn.k_proj': 'attn.key',
'self_attn.out_proj': 'attn.out',
'encoder_attn.out_proj': 'cross_attn.out',
'self_attn.q_proj': 'attn.query',
'encoder_attn.q_proj': 'cross_attn.query',
'self_attn.v_proj': 'attn.value',
'encoder_attn.v_proj': 'cross_attn.value',
'encoder_attn_layer_norm': 'cross_attn_ln',
'fc1': 'mlp.0',
'fc2': 'mlp.2',
'final_layer_norm': 'mlp_ln',
'encoder.layer_norm.bias': 'encoder.ln_post.bias',
'encoder.layer_norm.weight': 'encoder.ln_post.weight',
'encoder.embed_positions.weight': 'encoder.positional_embedding',
'decoder.layer_norm.bias': 'decoder.ln.bias',
'decoder.layer_norm.weight': 'decoder.ln.weight',
'decoder.embed_positions.weight': 'decoder.positional_embedding',
'decoder.embed_tokens.weight': 'decoder.token_embedding.weight',
}
from transformers import WhisperForConditionalGeneration
dir_model = "whisper-base"
with open(dir_model + "/vocab.json", "r") as f:
encoder = json.load(f)
with open(dir_model + "/added_tokens.json", "r") as f:
encoder_added = json.load(f)
with open(dir_model + "/config.json", "r") as f:
hparams = json.load(f)
model = WhisperForConditionalGeneration.from_pretrained(dir_model)
list_vars = model.state_dict()
dir_whisper = "whisper"
dir_out = "."
n_mels = hparams["num_mel_bins"]
with np.load(os.path.join(dir_whisper, "whisper/assets", "mel_filters.npz")) as f:
filters = torch.from_numpy(f[f"mel_{n_mels}"])
multilingual = hparams["vocab_size"] == 51865
dir_tokenizer = os.path.join(dir_whisper, "whisper/assets", multilingual and "multilingual" or "gpt2")
fname_out = dir_out + "/ggml-model.bin"
with open(dir_tokenizer + "/vocab.json", "r", encoding="utf8") as f:
tokens = json.load(f)
use_f16 = True
fout = open(fname_out, "wb")
fout.write(struct.pack("i", 0x67676d6c)) # magic: ggml in hex
fout.write(struct.pack("i", hparams["vocab_size"]))
fout.write(struct.pack("i", hparams["max_source_positions"]))
fout.write(struct.pack("i", hparams["d_model"]))
fout.write(struct.pack("i", hparams["decoder_attention_heads"]))
fout.write(struct.pack("i", hparams["decoder_layers"]))
fout.write(struct.pack("i", hparams["max_length"]))
fout.write(struct.pack("i", hparams["d_model"]))
fout.write(struct.pack("i", hparams["encoder_attention_heads"]))
fout.write(struct.pack("i", hparams["encoder_layers"]))
fout.write(struct.pack("i", hparams["num_mel_bins"]))
fout.write(struct.pack("i", use_f16))
fout.write(struct.pack("i", filters.shape[0]))
fout.write(struct.pack("i", filters.shape[1]))
for i in range(filters.shape[0]):
for j in range(filters.shape[1]):
fout.write(struct.pack("f", filters[i][j]))
byte_encoder = bytes_to_unicode()
byte_decoder = {v:k for k, v in byte_encoder.items()}
fout.write(struct.pack("i", len(tokens)))
for key in tokens:
text = bytearray([byte_decoder[c] for c in key])
fout.write(struct.pack("i", len(text)))
fout.write(text)
for name in list_vars.keys():
if name == "proj_out.weight":
continue
data = list_vars[name].squeeze().numpy()
data = data.astype(np.float16)
nn = name
nn = nn.split(".")[1:]
if nn[1] == "layers":
nn[1] = "blocks"
if ".".join(nn[3:-1]) == "self_attn.k_proj":
mapped = "attn.key" if nn[0] == "encoder" else "cross_attn.key"
else:
mapped = conv_map[".".join(nn[3:-1])]
name = ".".join(nn[:3] + [mapped] + nn[-1:])
else:
name = ".".join(nn)
name = conv_map[name] if name in conv_map else name
I added a conversion script:
https://github.com/ggerganov/whisper.cpp/blob/master/models/convert-h5-to-ggml.py
Use like this:
git clone https://github.com/openai/whisper
git clone https://github.com/ggerganov/whisper.cpp
git clone https://huggingface.co/openai/whisper-medium
python3 ./whisper.cpp/models/convert-h5-to-ggml.py ./whisper-medium/ ./whisper .
However, just as you noticed, it does not produce any output. The decoded tokens are invalid, but I haven't traced where the computation breaks down.
The proj_out.weight tensor from this model is currently ignored, but maybe it has to be used somehow? It's not present in the original OpenAI model, but it seems to be used by the "transformers" implementation. Not sure
The printour of the whisper model is this:
Ending with (proj_out): Linear(in_features=512, out_features=51865, bias=False)
WhisperForConditionalGeneration(
(model): WhisperModel(
(encoder): WhisperEncoder(
(conv1): Conv1d(80, 512, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): Conv1d(512, 512, kernel_size=(3,), stride=(2,), padding=(1,))
(embed_positions): Embedding(1500, 512)
(layers): ModuleList(
(0): WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(1): WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(2): WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(3): WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(4): WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(5): WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(decoder): WhisperDecoder(
(embed_tokens): Embedding(51865, 512, padding_idx=50257)
(embed_positions): WhisperPositionalEmbedding(448, 512)
(layers): ModuleList(
(0): WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(1): WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(2): WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(3): WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(4): WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(5): WhisperDecoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): WhisperAttention(
(k_proj): Linear(in_features=512, out_features=512, bias=False)
(v_proj): Linear(in_features=512, out_features=512, bias=True)
(q_proj): Linear(in_features=512, out_features=512, bias=True)
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
(proj_out): Linear(in_features=512, out_features=51865, bias=False)
)
So looking a bit more into this, I think that proj_out.weight
is not actually used:
https://github.com/huggingface/transformers/blob/9a5b84a0076a04fe9596da72e8668069d4f09ea0/src/transformers/models/whisper/modeling_whisper.py#L1099-L1106
Looking at this discussion, it seems like people are still struggling to make the fine-tuned models run with the original code from OpenAI. @luigisaetta claims to have successfully ran it (https://github.com/openai/whisper/discussions/64#discussioncomment-4217106), but they don't clarify what extra processing they did in addition to converting the model tensors (or at least, I don't see it).
Hopefully, they provide some more information on how to convert between the two models and if it is even possible to use the original code base to run HF models.
I made an mistake and mixed up encoder_attn and self_attn, after comparing to https://github.com/luigisaetta/whisper-app/blob/main/match_layers.py It worked after change
Unsure about which is decode and encode here too (its the same value) https://github.com/ggerganov/whisper.cpp/blob/388e9f79ad4c03801d3b2e2d14fb26c4faa938a6/models/convert-h5-to-ggml.py#L98-L103
@@ -9,26 +9,27 @@ import numpy as np
from transformers import WhisperForConditionalGeneration
conv_map = {
- 'encoder_attn.k_proj': 'attn.key',
conv_map = {
+ 'self_attn.k_proj': 'attn.key',
@@ -139,7 +141,7 @@ for name in list_vars.keys():
if nn[1] == "layers":
nn[1] = "blocks"
- if ".".join(nn[3:-1]) == "self_attn.k_proj":
+ if ".".join(nn[3:-1]) == "encoder_attn.k_proj":
mapped = "attn.key" if nn[0] == "encoder" else "cross_attn.key"
else:
mapped = conv_map[".".join(nn[3:-1])]
Great! I confirm that it works now. Tried both models:
- https://huggingface.co/openai/whisper-medium
- https://huggingface.co/openai/whisper-base.en
Unsure about which is decode and encode here too (its the same value)
First is Encoder, then the Decoder.
I fixed the hparams
order in the last commit.
Is it possible to have the convert script support hugginface format like the one here https://huggingface.co/openai/whisper-medium/tree/main ? The usecase is to run fine tuned models with cpp.
I don't understand in which way the model you mentioned is different from the model that comes with whisper.cpp (models/ggml-medium.bin
). Is it different? Or do you have other models from hugginface in mind? If yes: which?
So looking a bit more into this, I think that
proj_out.weight
is not actually used:https://github.com/huggingface/transformers/blob/9a5b84a0076a04fe9596da72e8668069d4f09ea0/src/transformers/models/whisper/modeling_whisper.py#L1099-L1106
Looking at this discussion, it seems like people are still struggling to make the fine-tuned models run with the original code from OpenAI. @luigisaetta claims to have successfully ran it (openai/whisper#64 (reply in thread)), but they don't clarify what extra processing they did in addition to converting the model tensors (or at least, I don't see it).
Hopefully, they provide some more information on how to convert between the two models and if it is even possible to use the original code base to run HF models.
Hi @ggerganov I have created an utility, based on an idea from @larsh0103 that first create a dictionary with the mapping of the names of layers (the keys in the state_dict). You can see how to use it and how to load a custom tuned model here:
https://github.com/luigisaetta/whisper-app/blob/main/match_layers.py
@luigisaetta
Hi, thanks for the help! We figured it out yesterday together with @jzju and we can now use HF models with whisper.cpp
.
@abelbabel
Yes, that model is the same. But it is possible to fine-tune your own model and it will be saved in the same format which you can now import and use in whisper.cpp