WhisperSpeech icon indicating copy to clipboard operation
WhisperSpeech copied to clipboard

CPU + MPS Support

Open fakerybakery opened this issue 1 year ago • 2 comments

Hi Do you know if CPU and MPS support is on the roadmap? Thanks!

fakerybakery avatar Jan 19 '24 19:01 fakerybakery

CPU could be supported through whisper.cpp/llama.cpp but we are not working on that right now. MPS should work with minimal tweaks (there may be some hardcoded “cuda” settings).

jpc avatar Jan 20 '24 17:01 jpc

Nice, thanks. Do you know how much work it would take to get WhisperSpeech working with whisper.cpp?

fakerybakery avatar Jan 20 '24 22:01 fakerybakery

Adding my vote for MPS support: I'd love to use this on Macs and iOS devices.

DePasqualeOrg avatar Jan 21 '24 17:01 DePasqualeOrg

Not sure if you can run Python on iOS w/o iSH

fakerybakery avatar Jan 21 '24 18:01 fakerybakery

@fakerybakery You can try and report back how difficult it is :)

I don't have this on my roadmap right now (I am mostly focused on improving quality and language coverage right now) but, if someone needs this, a consulting contract is a very effective way to make sure it happens.

jpc avatar Jan 22 '24 12:01 jpc

Would be great if someone add MPS support. Can't run this on Mac, and mac's are quite often used with LLMs now

Grzegosz avatar Jan 26 '24 17:01 Grzegosz

CPU could be supported through whisper.cpp/llama.cpp but we are not working on that right now. MPS should work with minimal tweaks (there may be some hardcoded “cuda” settings).

I might take this one on...but first please see my recent issue about pull requests and whether you're open to source code modifications without me using a Jupyter Notebook...unless someone wants to show me how.

Basically, I'd be considering tackling:

  1. ensuring AMD GPU-acceleration on Linux via rocM (unfortunately, pytorch doesn't support AMD GPUs on Windows) ---This should involve minimal changes since it uses the "cuda" device within the pytorch framework, so it'd just be a matter of doublechecking the code actually for minor changes.

  2. ensuring MPS support, which, again, involves minor changes (adding "mps" as a vible device within pytorch).

  3. likely adding source code-wide changes to use "cuda," "mps" or "cpu" as the default compute device depending on a user's system.

BBC-Esq avatar Feb 03 '24 12:02 BBC-Esq

Just left a response on https://github.com/collabora/WhisperSpeech/issues/73 would be great to have MPS support.

zoq avatar Feb 03 '24 16:02 zoq

@BBC-Esq we are using nbdev. it allows you to edit either the notebooks or the .py files and later synchronize the changes.

I am on holiday next week but afterwards I am happy to either help you setup nbdev or if you make a PR I can merge your changes back into the notebooks.

jpc avatar Feb 03 '24 20:02 jpc

modifying Whisperspeech to run on torch MPS backend was not so hard - just replaced .cuda() with .to("mps"), added map_location='mps' to couple torch.loads and removed 'with sdp_kernel' lines. But i hit some problem with vocoder - MPS doesnt have real x complex GEMMs(some assert) and no complex.out is implemented for MPS so need a little bit of help here

akorzh avatar Feb 05 '24 15:02 akorzh

Here's the pull request I did as well. Want to work together on this? https://github.com/collabora/WhisperSpeech/pull/77 I'm not that familiar with github, but I think there's a way to work together on a pull request?

BBC-Esq avatar Feb 05 '24 17:02 BBC-Esq

did you get it working - i did more changes and still wasnt able to run inference exampl notebook BTW all those .py files are generated from notebooks, so need to modify those as well

akorzh avatar Feb 05 '24 17:02 akorzh

No, the pull request was simply to show an example of choosing between "cuda," "mps" or "cpu" based on the get_compute_device function within utils.py. I was hoping to get feedback as far as that approach in general (a function that dynamically determines the compute device) before modifying the other scripts. Multiple other scripts will need to be modified to set the appropriate compute device dynamically if the developer approves this approach, basically.

Also, now we're aware of the issue that you raised regarding vocoder above. Was hoping to get the "go ahead" beforehand, basically. If you want to work on this together, I'm assuming we'd work on the branch I created (from which the pull request came from)? Kind of new to github...

BBC-Esq avatar Feb 05 '24 17:02 BBC-Esq

@jpc What did you think of the draft pull request. Am I on the right track and do you want me to work on modifying the other scripts as well?

BBC-Esq avatar Feb 05 '24 17:02 BBC-Esq

Regarding Vocos and MPS maybe it would be worth raising an issue on their GitHub and see what the author says? I was using this model as-is so I am unfortunately not familiar with its internals.

If this does not help I can try looking into this next week.

jpc avatar Feb 05 '24 18:02 jpc

The sdp_kernel is kind of important for performance on CUDA so we’d have to figure out how to make them transparent for MPS. Maybe make a new context manager that wraps the one from PyTorch?

jpc avatar Feb 05 '24 18:02 jpc

I'll do what I can on the draft pull request, but others will likely have to help since I don't have a MacOS to test on...I can at least get the overall framework there in terms of dynamically choosing the compute deivce across all scripts...

BBC-Esq avatar Feb 05 '24 18:02 BBC-Esq

ok i got it to work on Mac but had to move vocoder and encoder to cpu. MPS lacks support for The operator 'aten::complex.out' is not currently implemented for the MPS device. The operator 'aten::_fft_r2c' is not currently implemented for the MPS device

akorzh avatar Feb 06 '24 03:02 akorzh

Excellent, so we've whittled it down. Can you send a screen shot of trying to put it via mps anyways? That way I can see what the error says and try to troubleshoot. But with my revised scripts (i.e. draft pull request) MPS works for everything except the vocoder? Thanks.

BBC-Esq avatar Feb 06 '24 04:02 BBC-Esq

I was able to find this. https://qqaatw.dev/pytorch-mps-ops-coverage/ I couldn't find fft_r2c on there though.

BBC-Esq avatar Feb 06 '24 14:02 BBC-Esq

sorry i didn't use your pull request, just some hacked together code(which is quite similar but in more places). Need to have something working first i thought. Haven't you tried running on MPS yourself? i posted couple of requests to https://github.com/pytorch/pytorch/issues/77764

akorzh avatar Feb 06 '24 15:02 akorzh

Unfortunately I don't have an Apple computer...nor Linux for that matter. That's an extreme challenge when trying to write code that works with all three platforms for sure. I was able to find these links, however:

https://github.com/pytorch/pytorch/pull/116630 https://developer.apple.com/documentation/metal/metal_sample_code_library/customizing_a_pytorch_operation https://github.com/neuraloperator/neuraloperator

Not sure if they'll help.

My draft pull request has all the basic infrastructure there though, suppose we could modify it to exclude the vocoder from being loaded on MPS alone, but I'd like to hear back from the repository owner if he can confirm that you've said so we know for certain ya know?

BBC-Esq avatar Feb 06 '24 15:02 BBC-Esq

I was thinking about writing to the Vocos author since I believe sometimes the offending operations can be changed to something a little bit different that works out of the box on MPS.

jpc avatar Feb 06 '24 18:02 jpc

Do it! @akorzh do you have the script you used? Might help me troubleshoot.

BBC-Esq avatar Feb 07 '24 15:02 BBC-Esq

@jpc A few possible workarounds if we can't find a way to get vocos working on MPS out of the box...

  1. Manually implement the GEMMs or specific FFT operations using MPS primitives.

  2. Decompose the unsupported operations into smaller supported operations.

  3. Context Manager to automatically move operations to CPU/MPS when appropriate to ensure that as much as possible will run on MPS.

  4. Write custom kernels in the metal shading language and invoke them from python with PyObjC.

  5. Evaluate how MPS Graph within Core ML might help.

  6. Possibly use SYCL and DPC++ to write code that is portable across different GPU architectures, including potentially targeting Metal through an abstraction layer. Primarily designed for CUDA and OpenCL, could potentially be adapted to generate MSL code that runs on MPS.

  7. Using OpenCL/GL instead of MPS as a fallback rather than falling back to the CPU.

Thoughts anyone?

BBC-Esq avatar Feb 07 '24 15:02 BBC-Esq

Another option might be to use Vulkan. Llama.cpp just implemented a Vulkan backend, one version from gpt4All and another from another guy (forget his named). This would also allow gpu acceleration with AMD gpus on Windows and, according to the following link, on MacOS as well:

https://github.com/KhronosGroup/MoltenVK

BBC-Esq avatar Feb 07 '24 16:02 BBC-Esq

https://github.com/KhronosGroup/MoltenVK/issues/2154

BBC-Esq avatar Feb 07 '24 16:02 BBC-Esq

@jpc and @akorzh I think I may have found a solution. MLX for MacOS? Here's the operations it supports:

image

Here's the website link:

https://ml-explore.github.io/mlx/build/html/python/fft.html https://github.com/ml-explore/mlx

Take it with a grain of salt, but here's what gpt-4 says...so there might be an option optimized for apple already...I leave it to your expertise:

image

SEE ALSO HERE FOR MORE DETAIL:

https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.fft.rfft.html#mlx.core.fft.rfft

gpt-4 says they're the same...also ran through gpt the pytorch description here:

https://pytorch.org/cppdocs/api/function_namespaceat_1aaea819b1367e99c6ef062ac8335edba2.html

BBC-Esq avatar Feb 07 '24 23:02 BBC-Esq

hey crew, I spent a few hours last night and today working on both CPU and MPS updates to this codebase. I also ran into the same results as @akorzh , except that I didn't get it to run. Instead, attempting to keep everything on the CPU I ran into the "addmm_impl_cpu_" not implemented for 'Half' message inside the MultiHeadAttention.forward call. Perhaps it has to do with my environment running pytorch version '2.1.1' at the time of testing.

I spent time with the [spd_kernel](https://github.com/collabora/WhisperSpeech/blob/80b268b74900b2f7ca7a36a3c789607a3f4cd912/whisperspeech/s2a_delar_mup_wds_mlang.py#L500) line without a solution yet. To my understanding pytorch hasn't implemented Flash Attention, but there was an implementation at https://github.com/philipturner/metal-flash-attention.

Moving past that, I think if we can use functions from the Vulkan or the MLX library, like @BBC-Esq pointed out, it would be best. I've not yet worked with these projects yet so a lot is unfamiliar.

signalprime avatar Feb 12 '24 00:02 signalprime

patch.txt here is my patch which works on mac (runs mps and cpu for the rest)

akorzh avatar Feb 12 '24 00:02 akorzh