whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Python bindings (C-style API)

Open ArtyomZemlyak opened this issue 1 year ago • 63 comments

Good day everyone! I'm thinking about bindings for Python.

So far, I'm interested in 4 functionalities:

  1. Encoder processing
  2. Decoder processing
  3. Transcription of audio (feed audio bytes, get text)
  4. 3+Times of all words (feed audio bytes, get text + times of each word). Of course, it’s too early to think about the times of words, since even for a python implementation they are still not well done.

Perhaps in the near future, I will try to take up this task. But I had no experience with python bindings. So, if there are craftsmen who can do it quickly (if it can be done quickly... 😃), that would be cool!

ArtyomZemlyak avatar Oct 01 '22 07:10 ArtyomZemlyak

Some work around:

Building

main: ggml.o main.o
	g++ -L ggml.o -c -fPIC main.cpp -o main.o
	g++ -L ggml.o -shared -Wl,-soname,main.so -o main.so main.o ggml.o
	g++ -pthread -o main ggml.o main.o
	./main -h

ggml.o: ggml.c ggml.h
	gcc -O3 -mavx -mavx2 -mfma -mf16c -c -fPIC ggml.c -o ggml.o
	gcc -shared -Wl,-soname,ggml.so -o ggml.so ggml.o

main.o: main.cpp ggml.h
	g++ -pthread -O3 -std=c++11 -c main.cpp

Run main

import ctypes
import pathlib


if __name__ == "__main__":
    # Load the shared library into ctypes
    libname = pathlib.Path().absolute() / "main.so"
    whisper = ctypes.CDLL(libname)

    whisper.main.restype = None
    whisper.main.argtypes = ctypes.c_int, ctypes.POINTER(ctypes.c_char_p)

    args = (ctypes.c_char_p * 9)(
        b"-nt",
        b"--language", b"ru",
        b"-t", b"8",
        b"-m", b"../models/ggml-model-tiny.bin",
        b"-f", b"../audio/cuker1.wav"
    )
    whisper.main(len(args), args)

And its works!

ArtyomZemlyak avatar Oct 01 '22 08:10 ArtyomZemlyak

But with specific functions it is already more difficult:

  • You need to load the model at the C++ level
  • Ability to access its encode decode methods
  • In this case, the whole process with the loaded model should go in parallel with the Python

It might be worth considering running python and c++ in different threads/processes and sharing information between them, when its needed.

ArtyomZemlyak avatar Oct 01 '22 08:10 ArtyomZemlyak

Thank you very much for your interest in the project!

I think we first need a proper C-style wrapper of the model loading / encode and decode functionality / sampling strategies. After that we will easily create python and other language bindings. I've done similar work in my 'ggwave' project.

I agree that the encode and decode functionality should be exposed through the API as you suggested. It would give more flexibility to the users of the library/bindings.

ggerganov avatar Oct 01 '22 10:10 ggerganov

@ArtyomZemlyak First you reinvent the pytorch functions in c, then you want python bindings around them. Isn't the end result the same as what we have in pytorch?

aichr avatar Oct 04 '22 09:10 aichr

The initial API is now available on master:

https://github.com/ggerganov/whisper.cpp/blob/master/whisper.h

The first part allows more fine-grained control over the inference and also allows the user to implement their own sampling strategy using the predicted probabilities for each token.

The second part of the API includes methods for full inference - you simply provide the audio samples and choose the sampling parameters.

Most likely the API will change with time, but this is a good starting point.

ggerganov avatar Oct 04 '22 20:10 ggerganov

This is as far as I got trying to get the API working in Python.

It loads the model successfully, but gets a segmentation fault on whisper_full.

Any ideas?

import ctypes
import pathlib

if __name__ == "__main__":
    libname = pathlib.Path().absolute() / "whisper.so"
    whisper = ctypes.CDLL(libname)
    modelpath = b"models/ggml-medium.bin"
    model = whisper.whisper_init(modelpath)
    params = whisper.whisper_full_default_params(b"WHISPER_DECODE_GREEDY")
    w = open('samples/jfk.wav', "rb").read()
    result = whisper.whisper_full(model, params, w, b"16000")
    # Segmentation fault
    

Edit - Got some debugging info from gdb but it didn't help much: 0x00007ffff67916c6 in log_mel_spectrogram(float const*, int, int, int, int, int, int, whisper_filters const&, whisper_mel&)

richardburleigh avatar Oct 09 '22 13:10 richardburleigh

Here is one way to achieve this:

# build shared libwhisper.so
gcc -O3 -std=c11   -pthread -mavx -mavx2 -mfma -mf16c -fPIC -c ggml.c
g++ -O3 -std=c++11 -pthread --shared -fPIC -static-libstdc++ whisper.cpp ggml.o -o libwhisper.so

Use it from Python like this:

import ctypes
import pathlib

# this is needed to read the WAV file properly
from scipy.io import wavfile

libname     = "libwhisper.so"
fname_model = "models/ggml-tiny.en.bin"
fname_wav   = "samples/jfk.wav"

# this needs to match the C struct in whisper.h
class WhisperFullParams(ctypes.Structure):
    _fields_ = [
        ("strategy",             ctypes.c_int),
        ("n_threads",            ctypes.c_int),
        ("offset_ms",            ctypes.c_int),
        ("translate",            ctypes.c_bool),
        ("no_context",           ctypes.c_bool),
        ("print_special_tokens", ctypes.c_bool),
        ("print_progress",       ctypes.c_bool),
        ("print_realtime",       ctypes.c_bool),
        ("print_timestamps",     ctypes.c_bool),
        ("language",             ctypes.c_char_p),
        ("greedy",               ctypes.c_int * 1),
    ]

if __name__ == "__main__":
    # load library and model
    libname = pathlib.Path().absolute() / libname
    whisper = ctypes.CDLL(libname)

    # tell Python what are the return types of the functions
    whisper.whisper_init.restype                  = ctypes.c_void_p
    whisper.whisper_full_default_params.restype   = WhisperFullParams
    whisper.whisper_full_get_segment_text.restype = ctypes.c_char_p

    # initialize whisper.cpp context
    ctx = whisper.whisper_init(fname_model.encode("utf-8"))

    # get default whisper parameters and adjust as needed
    params = whisper.whisper_full_default_params(0)
    params.print_realtime = True
    params.print_progress = False

    # load WAV file
    samplerate, data = wavfile.read(fname_wav)

    # convert to 32-bit float
    data = data.astype('float32')/32768.0

    # run the inference
    result = whisper.whisper_full(ctypes.c_void_p(ctx), params, data.ctypes.data_as(ctypes.POINTER(ctypes.c_float)), len(data))
    if result != 0:
        print("Error: {}".format(result))
        exit(1)

    # print results from Python
    print("\nResults from Python:\n")
    n_segments = whisper.whisper_full_n_segments(ctypes.c_void_p(ctx))
    for i in range(n_segments):
        t0  = whisper.whisper_full_get_segment_t0(ctypes.c_void_p(ctx), i)
        t1  = whisper.whisper_full_get_segment_t1(ctypes.c_void_p(ctx), i)
        txt = whisper.whisper_full_get_segment_text(ctypes.c_void_p(ctx), i)

        print(f"{t0/1000.0:.3f} - {t1/1000.0:.3f} : {txt.decode('utf-8')}")

    # free the memory
    whisper.whisper_free(ctypes.c_void_p(ctx))

ggerganov avatar Oct 09 '22 14:10 ggerganov

Thank you @ggerganov - really appreciate your work!

Still getting a seg fault with your code, but I'll assume it's a me problem:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
log_mel_spectrogram (samples=<optimized out>, n_samples=<optimized out>, sample_rate=<optimized out>, fft_size=<optimized out>, fft_step=<optimized out>, n_mel=80, n_threads=<optimized out>, filters=..., mel=...) at whisper.cpp:1977
1977	    mel.data.resize(mel.n_mel*mel.n_len);
(gdb) bt
#0  log_mel_spectrogram (samples=<optimized out>, n_samples=<optimized out>, sample_rate=<optimized out>, fft_size=<optimized out>, fft_step=<optimized out>, n_mel=80, n_threads=<optimized out>, filters=..., mel=...) at whisper.cpp:1977
#1  0x00007fffc28d24c7 in whisper_pcm_to_mel (ctx=0x560d7680, samples=0x7fffb3345010, n_samples=176000, n_threads=4) at whisper.cpp:2101
#2  0x00007fffc28d4113 in whisper_full (ctx=0x560d7680, params=..., samples=<optimized out>, n_samples=<optimized out>) at whisper.cpp:2316

richardburleigh avatar Oct 09 '22 14:10 richardburleigh

Got a segfault in the same place on an Intel 12th gen CPU and M1 Macbook with no changes to the above Python script. Anyone else tried it?

Were you using the same codebase as master @ggerganov ?

richardburleigh avatar Oct 10 '22 05:10 richardburleigh

Yeah, the ctx pointer wasn't being passed properly. I've updated the python script above. Give it another try - I think it should work now.

ggerganov avatar Oct 10 '22 07:10 ggerganov

Could you possibly make a binding to the stream program as well? Would be super cool to be able to register a callback once user speech is done and silence/non-speech is detected so the final text can be processed within python. This would allow for some really cool speech assistant like hacks.

pachacamac avatar Oct 15 '22 13:10 pachacamac

Could you possibly make a binding to the stream program as well? Would be super cool to be able to register a callback once user speech is done and silence/non-speech is detected so the final text can be processed within python. This would allow for some really cool speech assistant like hacks.

You can easily modify this script to use Whisper.cpp instead of DeepSpeech.

richardburleigh avatar Oct 16 '22 02:10 richardburleigh

@pachacamac I made a hacked together fork of Buzz which uses whisper.cpp

It's buggy and thrown together, but works.

Just make sure you build the shared library as libwhisper.so and put it in the project directory. There's no install package, so you'll need to run main.py directly.

Edit: I also made a simple stand-alone script using Whisper.cpp + Auditok (to detect voices)

richardburleigh avatar Oct 16 '22 08:10 richardburleigh

Breaking changes in the C-api in last commit: e30cf831584a9b96df51849302de8bb35c5709ee

ggerganov avatar Oct 18 '22 15:10 ggerganov

I seem to be having some trouble making a shared lib on Windows (https://github.com/ggerganov/whisper.cpp/issues/9#issuecomment-1272555209 works great on UNIX).

Using:

gcc -O3 -std=c11   -pthread -mavx -mavx2 -mfma -mf16c -fPIC -c ggml.c -o ggml.o
g++ -O3 -std=c++11 -pthread --shared -fPIC -static-libstdc++ -DWHISPER_SHARED -DWHISPER_BUILD whisper.cpp ggml.o -o libwhisper.so

And calling from Python as:

whisper_cpp = ctypes.CDLL("libwhisper.so")

# Calling any one of the functions errors
whisper_cpp.whisper_init('path/to/model.bin'.encode('utf-8'))
whisper_cpp.whisper_lang_id('en'.encode('utf-8'))

I get:

Windows fatal exception: access violation

Current thread 0x00002b30 (most recent call first):
  File "C:\Users\willi\Documents\src\buzz\whispercpp_test.py", line 17 in <module>
Windows fatal exception: access violation

Current thread 0x00002b30 (most recent call first):
  File "C:\Users\willi\Documents\src\buzz\whispercpp_test.py", line 17 in <module>
Windows fatal exception: access violation

Current thread 0x00002b30 (most recent call first):
  File "C:\Users\willi\Documents\src\buzz\whispercpp_test.py", line 17 in <module>
Windows fatal exception: access violation

...

Current thread 0x00002b30 (most recent call first):
  File "C:\Users\willi\Documents\src\buzz\whispercpp_test.py", line 17 in <module>
Windows fatal exception: stack overflow

Current thread 0x00002b30 (most recent call first):
  File "C:\Users\willi\Documents\src\buzz\whispercpp_test.py", line 17 in <module>
Windows fatal exception: access violation

Current thread 0x00002b30 (most recent call first):
  File "C:\Users\willi\Documents\src\buzz\whispercpp_test.py", line 17 in <module>
Windows fatal exception: access violation

Current thread 0x00002b30 (most recent call first):
  File "C:\Users\willi\Documents\src\buzz\whispercpp_test.py", line 17 in <module>

Ref: https://github.com/chidiwilliams/buzz/issues/131

chidiwilliams avatar Oct 29 '22 08:10 chidiwilliams

@ggerganov thanks for all your help so far. I seem to be having an issue with the Python binding (similar to one you posted, not Windows).

class WhisperFullParams(ctypes.Structure):
    _fields_ = [
        ("strategy",             ctypes.c_int),
        ("n_threads",            ctypes.c_int),
        ("offset_ms",            ctypes.c_int),
        ("translate",            ctypes.c_bool),
        ("no_context",           ctypes.c_bool),
        ("print_special_tokens", ctypes.c_bool),
        ("print_progress",       ctypes.c_bool),
        ("print_realtime",       ctypes.c_bool),
        ("print_timestamps",     ctypes.c_bool),
        ("language",             ctypes.c_char_p),
        ("greedy",               ctypes.c_int * 1),
    ]


model_path = 'ggml-model-whisper-tiny.bin'
audio_path = './whisper.cpp/samples/jfk.wav'
libname = './whisper.cpp/libwhisper.dylib'

whisper_cpp = ctypes.CDLL(
    str(pathlib.Path().absolute() / libname))

whisper_cpp.whisper_init.restype = ctypes.c_void_p
whisper_cpp.whisper_full_default_params.restype = WhisperFullParams
whisper_cpp.whisper_full_get_segment_text.restype = ctypes.c_char_p

ctx = whisper_cpp.whisper_init(model_path.encode('utf-8'))

params = whisper_cpp.whisper_full_default_params(0)
params.print_realtime = True
params.print_progress = True


samplerate, audio = wavfile.read(audio_path)
audio = audio.astype('float32')/32768.0


result = whisper_cpp.whisper_full(
    ctypes.c_void_p(ctx), params, audio.ctypes.data_as(
        ctypes.POINTER(ctypes.c_float)), len(audio))
if result != 0:
    raise Exception(f'Error from whisper.cpp: {result}')


n_segments = whisper_cpp.whisper_full_n_segments(
    ctypes.c_void_p(ctx))
print(f'n_segments: {n_segments}')

Prints:

whisper_model_load: loading model from 'ggml-model-whisper-tiny.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem_required  = 476.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size =  73.58 MB
whisper_model_load: memory size =    11.41 MB
whisper_model_load: model size  =    73.54 MB
176000, length of samples
log_mel_spectrogram: n_samples = 176000, n_len = 1100
log_mel_spectrogram: recording length: 11.000000 s
length of spectrogram is less than 1s
n_segments: 0

I added an extra log line to show that whisper_full exits due to the length of the spectrogram being less than 1. I see the same issue with other audio files I try as well as when I read the audio sample using whisper.audio.load_audio

chidiwilliams avatar Nov 03 '22 08:11 chidiwilliams

The WhisperFullParams struct has been updated since I posted, so you have to match the new struct in the whisper.h. Ideally, the python bindings should be automatically generated based on the C API in order to avoid this kind of issues.

ggerganov avatar Nov 03 '22 08:11 ggerganov

Of course. Thanks a lot!

chidiwilliams avatar Nov 04 '22 07:11 chidiwilliams