llamafile
llamafile copied to clipboard
Support for whisper.cpp
Any chance of adding support for whisper.cpp? I know whisper.cpp is still stuck with the GGML format instead of GGUF, but it would be great to have portable whisper binaries that just work.
I agree, Whisper is awesome. I used to use it at the command line (which was very slow). I now use this project, and the inference times are 5-10x faster: https://github.com/jhj0517/Whisper-WebUI
I would also love speech input support for this, not only because it would be a really cool feature, but also because I sometimes get a bit of RSI so anything to help reduce the amount of typing needed is very helpful.
yeah this would be nice to have in the llama server, llamafile was the only way I have figured out to run this things I've been hearing about for a year!
Speech input is a big feature in my use case, I do it now with GPT-4 on iPhone but doing the same with llamafile's server would be fantastic. What are the main blockers?
Please, very interested in this use-case!
Devil's advocate: it's not very difficult to run whisper separately and pipe any recognised sentences into Llamafile? I'm literally doing that right now, for example. It's also relatively easy to do in the browser.
What would be the benefit? Would this integration allow the LLM to start processing detected words earlier?
It would have the same benefit that llamafile does. You wouldn't have to compile the software yourself.
Hi.
Is there any update regarding this request?
I was able to build whisper.cpp using cosmocc with very few modifications.
diff --git a/Makefile b/Makefile
index 93c89cd..b3a89d7 100644
--- a/Makefile
+++ b/Makefile
@@ -39,7 +39,7 @@ endif
#
CFLAGS = -I. -O3 -DNDEBUG -std=c11 -fPIC
-CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
+CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -fexceptions
LDFLAGS =
ifdef MACOSX_DEPLOYMENT_TARGET
@@ -134,38 +134,38 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686 amd64))
ifdef CPUINFO_CMD
AVX_M := $(shell $(CPUINFO_CMD) | grep -iwE 'AVX|AVX1.0')
ifneq (,$(AVX_M))
- CFLAGS += -mavx
- CXXFLAGS += -mavx
+ CFLAGS += -Xx86_64-mavx
+ CXXFLAGS += -Xx86_64-mavx
endif
AVX2_M := $(shell $(CPUINFO_CMD) | grep -iw 'AVX2')
ifneq (,$(AVX2_M))
- CFLAGS += -mavx2
- CXXFLAGS += -mavx2
+ CFLAGS += -Xx86_64-mavx2
+ CXXFLAGS += -Xx86_64-mavx2
endif
FMA_M := $(shell $(CPUINFO_CMD) | grep -iw 'FMA')
ifneq (,$(FMA_M))
- CFLAGS += -mfma
- CXXFLAGS += -mfma
+ CFLAGS += -Xx86_64-mfma
+ CXXFLAGS += -Xx86_64-mfma
endif
F16C_M := $(shell $(CPUINFO_CMD) | grep -iw 'F16C')
ifneq (,$(F16C_M))
- CFLAGS += -mf16c
- CXXFLAGS += -mf16c
+ CFLAGS += -Xx86_64-mf16c
+ CXXFLAGS += -Xx86_64-mf16c
endif
SSE3_M := $(shell $(CPUINFO_CMD) | grep -iwE 'PNI|SSE3')
ifneq (,$(SSE3_M))
- CFLAGS += -msse3
- CXXFLAGS += -msse3
+ CFLAGS += -Xx86_64-msse3
+ CXXFLAGS += -Xx86_64-msse3
endif
SSSE3_M := $(shell $(CPUINFO_CMD) | grep -iw 'SSSE3')
ifneq (,$(SSSE3_M))
- CFLAGS += -mssse3
- CXXFLAGS += -mssse3
+ CFLAGS += -Xx86_64-mssse3
+ CXXFLAGS += -Xx86_64-mssse3
endif
endif
endif
diff --git a/ggml.c b/ggml.c
index 4ee2c5e..521eafe 100644
--- a/ggml.c
+++ b/ggml.c
@@ -24,7 +24,7 @@
#include <stdarg.h>
#include <signal.h>
#if defined(__gnu_linux__)
-#include <syscall.h>
+#include <sys/syscall.h>
#endif
#ifdef GGML_USE_METAL
@@ -2069,6 +2069,8 @@ void ggml_numa_init(enum ggml_numa_strategy numa_flag) {
int getcpu_ret = 0;
#if __GLIBC__ > 2 || (__GLIBC__ == 2 && __GLIBC_MINOR__ > 28)
getcpu_ret = getcpu(¤t_cpu, &g_state.numa.current_node);
+#elif defined(__COSMOPOLITAN__)
+ current_cpu = sched_getcpu(), getcpu_ret = 0;
#else
// old glibc doesn't have a wrapper for this call. Fall back on direct syscall
getcpu_ret = syscall(SYS_getcpu,¤t_cpu,&g_state.numa.current_node);
I made a couple changes to cosmopolitan upstream that'll be incorporated in the next release for making it easier to build. More work would need to be done to do it as well as llamafile packages llama.cpp. But until then, you have this:
Wow, thanks @jart! That's amazing! Just confirming that it works like a charm :D
$ whisperfile -m ggml-model-q5_0.bin samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from 'ggml-model-q5_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 8
whisper_model_load: qntvr = 2
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_model_load: CPU total size = 1080.47 MB
whisper_model_load: model size = 1080.47 MB
whisper_init_state: kv self size = 220.20 MB
whisper_init_state: kv cross size = 245.76 MB
whisper_init_state: compute buffer (conv) = 36.26 MB
whisper_init_state: compute buffer (encode) = 926.66 MB
whisper_init_state: compute buffer (cross) = 9.38 MB
whisper_init_state: compute buffer (decode) = 209.26 MB
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 |
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.060 --> 00:00:07.500] And so, my dear Americans, do not ask what your country can do for you.
[00:00:07.500 --> 00:00:11.000] Ask what you can do for your country.
whisper_print_timings: load time = 1281.10 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 41.95 ms
whisper_print_timings: sample time = 102.85 ms / 159 runs ( 0.65 ms per run)
whisper_print_timings: encode time = 29479.98 ms / 1 runs (29479.98 ms per run)
whisper_print_timings: decode time = 38.76 ms / 1 runs ( 38.76 ms per run)
whisper_print_timings: batchd time = 3710.61 ms / 156 runs ( 23.79 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 34662.24 ms
Any instructions on how to package it together with a GGML model? Thanks again!
I just tried to compile it myself, to see it I was able to also get the stream binary to work. But after I apply the patch, the make command errors out with exponent has no digits:
whisper.cpp:2575:27: error: exponent has no digits
2575 | double theta = (2*M_PI*i)/SIN_COS_N_COUNT;
| ^~~~
whisper.cpp:2672:42: error: exponent has no digits
2672 | output[i] = 0.5*(1.0 - cosf((2.0*M_PI*i)/(length + offset)));
| ^~~~
I run it like this:
$ make CC=bin/cosmocc CXX=bin/cosmoc++ stream
@versae in your cosmocc toolchain just change include/libc/math.h to use the non-hex constants instead:
#define M_E 2.7182818284590452354 /* 𝑒 */
#define M_LOG2E 1.4426950408889634074 /* log₂𝑒 */
#define M_LOG10E 0.43429448190325182765 /* log₁₀𝑒 */
#define M_LN2 0.69314718055994530942 /* logₑ2 */
#define M_LN10 2.30258509299404568402 /* logₑ10 */
#define M_PI 3.14159265358979323846 /* pi */
#define M_PI_2 1.57079632679489661923 /* pi/2 */
#define M_PI_4 0.78539816339744830962 /* pi/4 */
#define M_1_PI 0.31830988618379067154 /* 1/pi */
#define M_2_PI 0.63661977236758134308 /* 2/pi */
#define M_2_SQRTPI 1.12837916709551257390 /* 2/sqrt(pi) */
#define M_SQRT2 1.41421356237309504880 /* sqrt(2) */
#define M_SQRT1_2 0.70710678118654752440 /* 1/sqrt(2) */
This will ship in the next cosmocc release.
After some tweaking, I was able to compile my own cosmocc and then use it to compile main, quantize and even server 🎉 . However, for stream there seems to be some issue with SDL2 library.
/usr/include/SDL2/SDL_config.h:4:10: fatal error: SDL2/_real_SDL_config.h: No such file or directory
4 | #include <SDL2/_real_SDL_config.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:402: stream] Error 1
I'll keep investigating as this could easily be some rookie mistake on my side.
Would it make sense to create a whisperllama repo for this?
After some tweaking, I was able to compile my own cosmocc and then use it to compile
main,quantizeand evenserver🎉 . However, forstreamthere seems to be some issue with SDL2 library./usr/include/SDL2/SDL_config.h:4:10: fatal error: SDL2/_real_SDL_config.h: No such file or directory 4 | #include <SDL2/_real_SDL_config.h> | ^~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. make: *** [Makefile:402: stream] Error 1I'll keep investigating as this could easily be some rookie mistake on my side.
Would it make sense to create a
whisperllamarepo for this?
I would suggest calling it Whisperfile :)
so i have been using a more generic method to get voice in/out it works flawless my issue has been getting the model to load at a decent speed i have many variation of this code idk if this one is broken or not but i know the voice in/out works flawlessly edit: also its in python so not sure if it helps but even if it does not sometimes simplicity is best text to speech can go a long way
import speech_recognition as sr
from llama_cpp import Llama
import pyttsx3
from pydub import AudioSegment
import simpleaudio
from transformers import AutoModel
import os
# Load GGUF model efficiently using llama-cpp
model = AutoModel.from_pretrained("sonu2023/Mistral-7B-Vatax-v1-q8_0-GUFF")
recognizer = sr.Recognizer()
chatbot_busy = False
engine = pyttsx3.init()
def play_activation_sound():
# Replace './computer.wav' with the path to your activation sound in WAV format
activation_sound = AudioSegment.from_file('./computer.wav')
simpleaudio.play_buffer(activation_sound.raw_data, num_channels=activation_sound.channels, bytes_per_sample=activation_sound.sample_width, sample_rate=activation_sound.frame_rate)
def chatbot_response(user_input):
global chatbot_busy
response = ""
if user_input and not chatbot_busy:
print("User:", user_input)
# Generate response using llama-cpp
prompt = f"[USER]: {user_input}\n[BOT]: " # Use a more explicit prompt format
response = llm.create_chat_completion(prompt=prompt)["messages"][-1]["content"]
print("Chatbot:", response)
chatbot_busy = False
# Text-to-speech with pyttsx3
text_to_speech(response)
def text_to_speech(text):
# Save the synthesized speech to a temporary WAV file
engine.save_to_file(text, 'output.wav')
engine.runAndWait()
# Play the temporary WAV file
synthesized_sound = AudioSegment.from_file('output.wav')
simpleaudio.play_buffer(synthesized_sound.raw_data, num_channels=synthesized_sound.channels, bytes_per_sample=synthesized_sound.sample_width, sample_rate=synthesized_sound.frame_rate)
# Remove the temporary WAV file
os.remove('output.wav')
def listen_for_input():
global chatbot_busy
with sr.Microphone() as source:
recognizer.adjust_for_ambient_noise(source)
while True:
try:
print("Listening...")
audio_data = recognizer.listen(source)
user_input = recognizer.recognize_google(audio_data).lower()
print("User:", user_input)
if 'computer' in user_input:
print("Chatbot activated. Speak now.")
play_activation_sound()
audio_data = recognizer.listen(source)
print("Listening...")
user_input = recognizer.recognize_google(audio_data).lower()
# Generate and respond using llama-cpp
chatbot_response(user_input)
except sr.UnknownValueError:
print("Could not understand audio. Please try again.")
except Exception as e:
print(f"An error occurred: {e}")
# Start listening for input
input_thread = threading.Thread(target=listen_for_input)
input_thread.start()
@jart I am in progress on getting a version of whisper.cpp built with llamafile, specifically the server example
The executable itself is working and seems to be compiling properly for CUDA. However I would love some help with the file loading from within the zipaligned archive. If you could provide some guidance on what needs to be done in order to implement this portion that would be great.
I have replaced the std::ifstream opening with llamafile_open_gguf, however I am running into errors with this. I recognize maybe this function needs modification in order to load the whisper models which are not .gguf directly. Currently I get the warning warning: not a pkzip archive and it seems like it is trying to load the file from the local directory as opposed to from the zipaligned version. Not sure if I need to manipulate the filepath in some way or if this is handled with some utility function.
I am currently using the files llama.cpp and server.cpp as reference for what I should be doing, but would love any help if you know the implementation off the top of your head.
If you've already discovered llamafile/llamafile.c then I'm not sure what other high level guidance I can offer you.
Thanks @jart, that was all I needed. My C skills are a bit rusty so it was great to know I wasn't missing anything obvious, instead I was just forgetting some C basics
For the time being I've forked the llamafile into: https://github.com/cjpais/whisperfile
If it makes sense to integrate directly into llamafile, I am happy to clean up the code and submit a PR. If this is the case just let me know how you would like the dirs to be structured
Why not just try to integrate/cosmopolitan-ize talk-llama into llamafile? Didn’t he already do all the heavy-lifting around this perhaps?
it can also be done, probably fairly easy to do in the whisperfile repo. I needed server for a project I am doing so that was my primary focus. If there is enough interest I can port over talk-llama, happy to accept PR's as well