llamafile Support for whisper.cpp

trafficstars

Any chance of adding support for whisper.cpp? I know whisper.cpp is still stuck with the GGML format instead of GGUF, but it would be great to have portable whisper binaries that just work.

Nov 30 '23 15:11 versae

I agree, Whisper is awesome. I used to use it at the command line (which was very slow). I now use this project, and the inference times are 5-10x faster: https://github.com/jhj0517/Whisper-WebUI

Dec 01 '23 16:12 benwilcock

I would also love speech input support for this, not only because it would be a really cool feature, but also because I sometimes get a bit of RSI so anything to help reduce the amount of typing needed is very helpful.

Dec 08 '23 12:12 asmith26

yeah this would be nice to have in the llama server, llamafile was the only way I have figured out to run this things I've been hearing about for a year!

Dec 09 '23 15:12 Kreijstal

Speech input is a big feature in my use case, I do it now with GPT-4 on iPhone but doing the same with llamafile's server would be fantastic. What are the main blockers?

Dec 09 '23 21:12 ingenieroariel

Please, very interested in this use-case!

Dec 13 '23 20:12 smrl

Devil's advocate: it's not very difficult to run whisper separately and pipe any recognised sentences into Llamafile? I'm literally doing that right now, for example. It's also relatively easy to do in the browser.

What would be the benefit? Would this integration allow the LLM to start processing detected words earlier?

Jan 24 '24 10:01 flatsiedatsie

It would have the same benefit that llamafile does. You wouldn't have to compile the software yourself.

Jan 24 '24 19:01 jart

Hi.

Is there any update regarding this request?

Feb 19 '24 19:02 AmgadHasan

I was able to build whisper.cpp using cosmocc with very few modifications.

diff --git a/Makefile b/Makefile
index 93c89cd..b3a89d7 100644
--- a/Makefile
+++ b/Makefile
@@ -39,7 +39,7 @@ endif
 #

 CFLAGS   = -I.              -O3 -DNDEBUG -std=c11   -fPIC
-CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
+CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -fexceptions
 LDFLAGS  =

 ifdef MACOSX_DEPLOYMENT_TARGET
@@ -134,38 +134,38 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686 amd64))
        ifdef CPUINFO_CMD
                AVX_M := $(shell $(CPUINFO_CMD) | grep -iwE 'AVX|AVX1.0')
                ifneq (,$(AVX_M))
-                       CFLAGS   += -mavx
-                       CXXFLAGS += -mavx
+                       CFLAGS   += -Xx86_64-mavx
+                       CXXFLAGS += -Xx86_64-mavx
                endif

                AVX2_M := $(shell $(CPUINFO_CMD) | grep -iw 'AVX2')
                ifneq (,$(AVX2_M))
-                       CFLAGS   += -mavx2
-                       CXXFLAGS += -mavx2
+                       CFLAGS   += -Xx86_64-mavx2
+                       CXXFLAGS += -Xx86_64-mavx2
                endif

                FMA_M := $(shell $(CPUINFO_CMD) | grep -iw 'FMA')
                ifneq (,$(FMA_M))
-                       CFLAGS   += -mfma
-                       CXXFLAGS += -mfma
+                       CFLAGS   += -Xx86_64-mfma
+                       CXXFLAGS += -Xx86_64-mfma
                endif

                F16C_M := $(shell $(CPUINFO_CMD) | grep -iw 'F16C')
                ifneq (,$(F16C_M))
-                       CFLAGS   += -mf16c
-                       CXXFLAGS += -mf16c
+                       CFLAGS   += -Xx86_64-mf16c
+                       CXXFLAGS += -Xx86_64-mf16c
                endif

                SSE3_M := $(shell $(CPUINFO_CMD) | grep -iwE 'PNI|SSE3')
                ifneq (,$(SSE3_M))
-                       CFLAGS   += -msse3
-                       CXXFLAGS += -msse3
+                       CFLAGS   += -Xx86_64-msse3
+                       CXXFLAGS += -Xx86_64-msse3
                endif

                SSSE3_M := $(shell $(CPUINFO_CMD) | grep -iw 'SSSE3')
                ifneq (,$(SSSE3_M))
-                       CFLAGS   += -mssse3
-                       CXXFLAGS += -mssse3
+                       CFLAGS   += -Xx86_64-mssse3
+                       CXXFLAGS += -Xx86_64-mssse3
                endif
        endif
 endif
diff --git a/ggml.c b/ggml.c
index 4ee2c5e..521eafe 100644
--- a/ggml.c
+++ b/ggml.c
@@ -24,7 +24,7 @@
 #include <stdarg.h>
 #include <signal.h>
 #if defined(__gnu_linux__)
-#include <syscall.h>
+#include <sys/syscall.h>
 #endif

 #ifdef GGML_USE_METAL
@@ -2069,6 +2069,8 @@ void ggml_numa_init(enum ggml_numa_strategy numa_flag) {
     int getcpu_ret = 0;
 #if __GLIBC__ > 2 || (__GLIBC__ == 2 && __GLIBC_MINOR__ > 28)
     getcpu_ret = getcpu(&current_cpu, &g_state.numa.current_node);
+#elif defined(__COSMOPOLITAN__)
+    current_cpu = sched_getcpu(), getcpu_ret = 0;
 #else
     // old glibc doesn't have a wrapper for this call. Fall back on direct syscall
     getcpu_ret = syscall(SYS_getcpu,&current_cpu,&g_state.numa.current_node);

I made a couple changes to cosmopolitan upstream that'll be incorporated in the next release for making it easier to build. More work would need to be done to do it as well as llamafile packages llama.cpp. But until then, you have this:

whisperfile.gz

Feb 19 '24 20:02 jart

Wow, thanks @jart! That's amazing! Just confirming that it works like a charm :D

$ whisperfile -m ggml-model-q5_0.bin samples/jfk.wav 
whisper_init_from_file_with_params_no_state: loading model from 'ggml-model-q5_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 8
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load:      CPU total size =  1080.47 MB
whisper_model_load: model size    = 1080.47 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   36.26 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.060 --> 00:00:07.500]   And so, my dear Americans, do not ask what your country can do for you.
[00:00:07.500 --> 00:00:11.000]   Ask what you can do for your country.


whisper_print_timings:     load time =  1281.10 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    41.95 ms
whisper_print_timings:   sample time =   102.85 ms /   159 runs (    0.65 ms per run)
whisper_print_timings:   encode time = 29479.98 ms /     1 runs (29479.98 ms per run)
whisper_print_timings:   decode time =    38.76 ms /     1 runs (   38.76 ms per run)
whisper_print_timings:   batchd time =  3710.61 ms /   156 runs (   23.79 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 34662.24 ms

Any instructions on how to package it together with a GGML model? Thanks again!

Feb 20 '24 10:02 versae

I just tried to compile it myself, to see it I was able to also get the stream binary to work. But after I apply the patch, the make command errors out with exponent has no digits:

whisper.cpp:2575:27: error: exponent has no digits
 2575 |         double theta = (2*M_PI*i)/SIN_COS_N_COUNT;
      |                           ^~~~
whisper.cpp:2672:42: error: exponent has no digits
 2672 |         output[i] = 0.5*(1.0 - cosf((2.0*M_PI*i)/(length + offset)));
      |                                          ^~~~

I run it like this:

$ make CC=bin/cosmocc CXX=bin/cosmoc++ stream

Feb 20 '24 11:02 versae

@versae in your cosmocc toolchain just change include/libc/math.h to use the non-hex constants instead:

#define M_E        2.7182818284590452354  /* 𝑒 */
#define M_LOG2E    1.4426950408889634074  /* log₂𝑒 */
#define M_LOG10E   0.43429448190325182765 /* log₁₀𝑒 */
#define M_LN2      0.69314718055994530942 /* logₑ2 */
#define M_LN10     2.30258509299404568402 /* logₑ10 */
#define M_PI       3.14159265358979323846 /* pi */
#define M_PI_2     1.57079632679489661923 /* pi/2 */
#define M_PI_4     0.78539816339744830962 /* pi/4 */
#define M_1_PI     0.31830988618379067154 /* 1/pi */
#define M_2_PI     0.63661977236758134308 /* 2/pi */
#define M_2_SQRTPI 1.12837916709551257390 /* 2/sqrt(pi) */
#define M_SQRT2    1.41421356237309504880 /* sqrt(2) */
#define M_SQRT1_2  0.70710678118654752440 /* 1/sqrt(2) */

This will ship in the next cosmocc release.

Feb 20 '24 13:02 jart

After some tweaking, I was able to compile my own cosmocc and then use it to compile main, quantize and even server 🎉 . However, for stream there seems to be some issue with SDL2 library.

/usr/include/SDL2/SDL_config.h:4:10: fatal error: SDL2/_real_SDL_config.h: No such file or directory
    4 | #include <SDL2/_real_SDL_config.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:402: stream] Error 1

I'll keep investigating as this could easily be some rookie mistake on my side.

Would it make sense to create a whisperllama repo for this?

Feb 20 '24 18:02 versae

After some tweaking, I was able to compile my own cosmocc and then use it to compile main, quantize and even server 🎉 . However, for stream there seems to be some issue with SDL2 library.
/usr/include/SDL2/SDL_config.h:4:10: fatal error: SDL2/_real_SDL_config.h: No such file or directory
    4 | #include <SDL2/_real_SDL_config.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:402: stream] Error 1
I'll keep investigating as this could easily be some rookie mistake on my side.

Would it make sense to create a whisperllama repo for this?

I would suggest calling it Whisperfile :)

Feb 20 '24 23:02 AmgadHasan

so i have been using a more generic method to get voice in/out it works flawless my issue has been getting the model to load at a decent speed i have many variation of this code idk if this one is broken or not but i know the voice in/out works flawlessly edit: also its in python so not sure if it helps but even if it does not sometimes simplicity is best text to speech can go a long way

import speech_recognition as sr
from llama_cpp import Llama
import pyttsx3
from pydub import AudioSegment
import simpleaudio
from transformers import AutoModel
import os


# Load GGUF model efficiently using llama-cpp
model = AutoModel.from_pretrained("sonu2023/Mistral-7B-Vatax-v1-q8_0-GUFF")

recognizer = sr.Recognizer()
chatbot_busy = False

engine = pyttsx3.init()


def play_activation_sound():
    # Replace './computer.wav' with the path to your activation sound in WAV format
    activation_sound = AudioSegment.from_file('./computer.wav')
    simpleaudio.play_buffer(activation_sound.raw_data, num_channels=activation_sound.channels, bytes_per_sample=activation_sound.sample_width, sample_rate=activation_sound.frame_rate)


def chatbot_response(user_input):
    global chatbot_busy
    response = ""

    if user_input and not chatbot_busy:
        print("User:", user_input)

        # Generate response using llama-cpp
        prompt = f"[USER]: {user_input}\n[BOT]: "  # Use a more explicit prompt format
        response = llm.create_chat_completion(prompt=prompt)["messages"][-1]["content"]

        print("Chatbot:", response)
        chatbot_busy = False

        # Text-to-speech with pyttsx3
        text_to_speech(response)


def text_to_speech(text):
    # Save the synthesized speech to a temporary WAV file
    engine.save_to_file(text, 'output.wav')
    engine.runAndWait()

    # Play the temporary WAV file
    synthesized_sound = AudioSegment.from_file('output.wav')
    simpleaudio.play_buffer(synthesized_sound.raw_data, num_channels=synthesized_sound.channels, bytes_per_sample=synthesized_sound.sample_width, sample_rate=synthesized_sound.frame_rate)

    # Remove the temporary WAV file
    os.remove('output.wav')


def listen_for_input():
    global chatbot_busy

    with sr.Microphone() as source:
        recognizer.adjust_for_ambient_noise(source)

        while True:
            try:
                print("Listening...")
                audio_data = recognizer.listen(source)
                user_input = recognizer.recognize_google(audio_data).lower()
                print("User:", user_input)

                if 'computer' in user_input:
                    print("Chatbot activated. Speak now.")
                    play_activation_sound()

                    audio_data = recognizer.listen(source)
                    print("Listening...")
                    user_input = recognizer.recognize_google(audio_data).lower()

                    # Generate and respond using llama-cpp
                    chatbot_response(user_input)

            except sr.UnknownValueError:
                print("Could not understand audio. Please try again.")
            except Exception as e:
                print(f"An error occurred: {e}")


# Start listening for input
input_thread = threading.Thread(target=listen_for_input)
input_thread.start()

Apr 01 '24 23:04 develperbayman

@jart I am in progress on getting a version of whisper.cpp built with llamafile, specifically the server example

The executable itself is working and seems to be compiling properly for CUDA. However I would love some help with the file loading from within the zipaligned archive. If you could provide some guidance on what needs to be done in order to implement this portion that would be great.

I have replaced the std::ifstream opening with llamafile_open_gguf, however I am running into errors with this. I recognize maybe this function needs modification in order to load the whisper models which are not .gguf directly. Currently I get the warning warning: not a pkzip archive and it seems like it is trying to load the file from the local directory as opposed to from the zipaligned version. Not sure if I need to manipulate the filepath in some way or if this is handled with some utility function.

I am currently using the files llama.cpp and server.cpp as reference for what I should be doing, but would love any help if you know the implementation off the top of your head.

May 16 '24 19:05 cjpais

If you've already discovered llamafile/llamafile.c then I'm not sure what other high level guidance I can offer you.

May 16 '24 20:05 jart

Thanks @jart, that was all I needed. My C skills are a bit rusty so it was great to know I wasn't missing anything obvious, instead I was just forgetting some C basics

For the time being I've forked the llamafile into: https://github.com/cjpais/whisperfile

If it makes sense to integrate directly into llamafile, I am happy to clean up the code and submit a PR. If this is the case just let me know how you would like the dirs to be structured

May 18 '24 19:05 cjpais

Why not just try to integrate/cosmopolitan-ize talk-llama into llamafile? Didn’t he already do all the heavy-lifting around this perhaps?

May 27 '24 17:05 jgbrwn

it can also be done, probably fairly easy to do in the whisperfile repo. I needed server for a project I am doing so that was my primary focus. If there is enough interest I can port over talk-llama, happy to accept PR's as well

May 27 '24 19:05 cjpais

llamafile llamafile copied to clipboard

Support for whisper.cpp

llamafile
llamafile copied to clipboard