WhisperSpeech
WhisperSpeech copied to clipboard
CPU + MPS Support
Hi Do you know if CPU and MPS support is on the roadmap? Thanks!
CPU could be supported through whisper.cpp/llama.cpp but we are not working on that right now. MPS should work with minimal tweaks (there may be some hardcoded “cuda” settings).
Nice, thanks. Do you know how much work it would take to get WhisperSpeech working with whisper.cpp?
Adding my vote for MPS support: I'd love to use this on Macs and iOS devices.
Not sure if you can run Python on iOS w/o iSH
@fakerybakery You can try and report back how difficult it is :)
I don't have this on my roadmap right now (I am mostly focused on improving quality and language coverage right now) but, if someone needs this, a consulting contract is a very effective way to make sure it happens.
Would be great if someone add MPS support. Can't run this on Mac, and mac's are quite often used with LLMs now
CPU could be supported through whisper.cpp/llama.cpp but we are not working on that right now. MPS should work with minimal tweaks (there may be some hardcoded “cuda” settings).
I might take this one on...but first please see my recent issue about pull requests and whether you're open to source code modifications without me using a Jupyter Notebook...unless someone wants to show me how.
Basically, I'd be considering tackling:
-
ensuring AMD GPU-acceleration on Linux via rocM (unfortunately, pytorch doesn't support AMD GPUs on Windows) ---This should involve minimal changes since it uses the "cuda" device within the pytorch framework, so it'd just be a matter of doublechecking the code actually for minor changes.
-
ensuring MPS support, which, again, involves minor changes (adding "mps" as a vible device within pytorch).
-
likely adding source code-wide changes to use "cuda," "mps" or "cpu" as the default compute device depending on a user's system.
Just left a response on https://github.com/collabora/WhisperSpeech/issues/73 would be great to have MPS support.
@BBC-Esq we are using nbdev. it allows you to edit either the notebooks or the .py files and later synchronize the changes.
I am on holiday next week but afterwards I am happy to either help you setup nbdev or if you make a PR I can merge your changes back into the notebooks.
modifying Whisperspeech to run on torch MPS backend was not so hard - just replaced .cuda() with .to("mps"), added map_location='mps' to couple torch.loads and removed 'with sdp_kernel' lines. But i hit some problem with vocoder - MPS doesnt have real x complex GEMMs(some assert) and no complex.out is implemented for MPS so need a little bit of help here
Here's the pull request I did as well. Want to work together on this? https://github.com/collabora/WhisperSpeech/pull/77 I'm not that familiar with github, but I think there's a way to work together on a pull request?
did you get it working - i did more changes and still wasnt able to run inference exampl notebook BTW all those .py files are generated from notebooks, so need to modify those as well
No, the pull request was simply to show an example of choosing between "cuda," "mps" or "cpu" based on the get_compute_device function within utils.py. I was hoping to get feedback as far as that approach in general (a function that dynamically determines the compute device) before modifying the other scripts. Multiple other scripts will need to be modified to set the appropriate compute device dynamically if the developer approves this approach, basically.
Also, now we're aware of the issue that you raised regarding vocoder above. Was hoping to get the "go ahead" beforehand, basically. If you want to work on this together, I'm assuming we'd work on the branch I created (from which the pull request came from)? Kind of new to github...
@jpc What did you think of the draft pull request. Am I on the right track and do you want me to work on modifying the other scripts as well?
Regarding Vocos and MPS maybe it would be worth raising an issue on their GitHub and see what the author says? I was using this model as-is so I am unfortunately not familiar with its internals.
If this does not help I can try looking into this next week.
The sdp_kernel is kind of important for performance on CUDA so we’d have to figure out how to make them transparent for MPS. Maybe make a new context manager that wraps the one from PyTorch?
I'll do what I can on the draft pull request, but others will likely have to help since I don't have a MacOS to test on...I can at least get the overall framework there in terms of dynamically choosing the compute deivce across all scripts...
ok i got it to work on Mac but had to move vocoder and encoder to cpu. MPS lacks support for The operator 'aten::complex.out' is not currently implemented for the MPS device. The operator 'aten::_fft_r2c' is not currently implemented for the MPS device
Excellent, so we've whittled it down. Can you send a screen shot of trying to put it via mps anyways? That way I can see what the error says and try to troubleshoot. But with my revised scripts (i.e. draft pull request) MPS works for everything except the vocoder? Thanks.
I was able to find this. https://qqaatw.dev/pytorch-mps-ops-coverage/ I couldn't find fft_r2c on there though.
sorry i didn't use your pull request, just some hacked together code(which is quite similar but in more places). Need to have something working first i thought. Haven't you tried running on MPS yourself? i posted couple of requests to https://github.com/pytorch/pytorch/issues/77764
Unfortunately I don't have an Apple computer...nor Linux for that matter. That's an extreme challenge when trying to write code that works with all three platforms for sure. I was able to find these links, however:
https://github.com/pytorch/pytorch/pull/116630 https://developer.apple.com/documentation/metal/metal_sample_code_library/customizing_a_pytorch_operation https://github.com/neuraloperator/neuraloperator
Not sure if they'll help.
My draft pull request has all the basic infrastructure there though, suppose we could modify it to exclude the vocoder from being loaded on MPS alone, but I'd like to hear back from the repository owner if he can confirm that you've said so we know for certain ya know?
I was thinking about writing to the Vocos author since I believe sometimes the offending operations can be changed to something a little bit different that works out of the box on MPS.
Do it! @akorzh do you have the script you used? Might help me troubleshoot.
@jpc A few possible workarounds if we can't find a way to get vocos working on MPS out of the box...
-
Manually implement the GEMMs or specific FFT operations using MPS primitives.
-
Decompose the unsupported operations into smaller supported operations.
-
Context Manager to automatically move operations to CPU/MPS when appropriate to ensure that as much as possible will run on MPS.
-
Write custom kernels in the metal shading language and invoke them from python with PyObjC.
-
Evaluate how MPS Graph within Core ML might help.
-
Possibly use SYCL and DPC++ to write code that is portable across different GPU architectures, including potentially targeting Metal through an abstraction layer. Primarily designed for CUDA and OpenCL, could potentially be adapted to generate MSL code that runs on MPS.
-
Using OpenCL/GL instead of MPS as a fallback rather than falling back to the CPU.
Thoughts anyone?
Another option might be to use Vulkan. Llama.cpp just implemented a Vulkan backend, one version from gpt4All and another from another guy (forget his named). This would also allow gpu acceleration with AMD gpus on Windows and, according to the following link, on MacOS as well:
https://github.com/KhronosGroup/MoltenVK
https://github.com/KhronosGroup/MoltenVK/issues/2154
@jpc and @akorzh I think I may have found a solution. MLX for MacOS? Here's the operations it supports:
Here's the website link:
https://ml-explore.github.io/mlx/build/html/python/fft.html https://github.com/ml-explore/mlx
Take it with a grain of salt, but here's what gpt-4 says...so there might be an option optimized for apple already...I leave it to your expertise:
SEE ALSO HERE FOR MORE DETAIL:
https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.fft.rfft.html#mlx.core.fft.rfft
gpt-4 says they're the same...also ran through gpt the pytorch description here:
https://pytorch.org/cppdocs/api/function_namespaceat_1aaea819b1367e99c6ef062ac8335edba2.html
hey crew, I spent a few hours last night and today working on both CPU and MPS updates to this codebase. I also ran into the same results as @akorzh , except that I didn't get it to run. Instead, attempting to keep everything on the CPU I ran into the "addmm_impl_cpu_" not implemented for 'Half' message inside the MultiHeadAttention.forward call. Perhaps it has to do with my environment running pytorch version '2.1.1' at the time of testing.
I spent time with the [spd_kernel](https://github.com/collabora/WhisperSpeech/blob/80b268b74900b2f7ca7a36a3c789607a3f4cd912/whisperspeech/s2a_delar_mup_wds_mlang.py#L500) line without a solution yet. To my understanding pytorch hasn't implemented Flash Attention, but there was an implementation at https://github.com/philipturner/metal-flash-attention.
Moving past that, I think if we can use functions from the Vulkan or the MLX library, like @BBC-Esq pointed out, it would be best. I've not yet worked with these projects yet so a lot is unfamiliar.
patch.txt here is my patch which works on mac (runs mps and cpu for the rest)