stable-diffusion.cpp icon indicating copy to clipboard operation
stable-diffusion.cpp copied to clipboard

OpenCL seems to almost work

Open Happenedtostumblein opened this issue 9 months ago • 12 comments

@leejet @Green-Sky @ggerganov

I do not know cpp and do not have a solid grasp on how ggml works. , but building the repo with cmake -dggml_clblast=ON seems to work as the GPU utilization goes up and it’s very fast (10s vs 80s per step on a higher end CPU). It does complete all the steps and completes sampling too, but then crashes at line 1505 of ggml-opencl.

If it is a matter of spending time to make this work, is it simple enough for one of you to explain what needs to be done? If so, would be happy to give it a shot but don’t know where to start.

My limited understanding is that sampling is what takes all the effort, so is there a way to maybe switch from GPU to CPU to save the file? Or am I missing some context/knowledge?

Edit: Fixed typo. Flag used is clblast, not openblas.

Happenedtostumblein avatar Sep 05 '23 18:09 Happenedtostumblein

Try this patch: https://github.com/ggerganov/llama.cpp/commit/6460f758dbd472653296044d36bed8c4554988f5

ggerganov avatar Sep 05 '23 18:09 ggerganov

@ggerganov That worked, thank you!

Is it proper protocol to submit a pull request for a one-liner?

Edit: FYI: It allows entire process to complete, but does not actually make use of GPU.

Happenedtostumblein avatar Sep 05 '23 18:09 Happenedtostumblein

FYI: It does work, but GPU utilization is very low. Got any more simple speedups in your pocket? @ggerganov

Happenedtostumblein avatar Sep 05 '23 19:09 Happenedtostumblein

I'm sorry to disappoint you but openblas doesn't use the gpu to accelerate the processing but it uses the cpu itself. If anything you should try DGGML-CLBLAST=ON in order to use OpenCL but it still wouldn't work as the developer still hasn't integrated any gpu acceleration into the program.

daniandtheweb avatar Sep 05 '23 22:09 daniandtheweb

@DaniAndTheWeb Thanks for pointing that out…it was a typo and the CLBLAST flag is what I was referring to.

How difficult/time-sensitive of a task is it going to be to incorporate OpenCL? With that flag, the gpu does get some kind of signal because utilization increases.

Just wondering if it’s a very involved process, or if we just need to copy/paste something from llama and/or ggml?

Happenedtostumblein avatar Sep 05 '23 23:09 Happenedtostumblein

I'm no expert in opencl but it will require some time, it's not just a copy/paste. The good news is that with the current ram usage the gpu acceleration will probably be one of the more memory efficient.

daniandtheweb avatar Sep 05 '23 23:09 daniandtheweb

@DaniAndTheWeb Can you tell me broadly speaking what tasks need to be completed, like I’m a 5?

Maybe CodeLlama can help me contribute a pull request to get it done, but I need a thread to grab onto. (Not sure if tagging is necessary, new to Github)

Happenedtostumblein avatar Sep 05 '23 23:09 Happenedtostumblein

As I told you I don't know a lot about how the OpenCL implementation works but you probably have to implement each computing kernel of the stock cpu code in opencl. You can take a look at llama.cpp's implementation but you will need to make lots of tweaks to the code to make it work with this project.

daniandtheweb avatar Sep 06 '23 13:09 daniandtheweb

No problem, hold my beer.

<<only really knows python>>

Happenedtostumblein avatar Sep 06 '23 14:09 Happenedtostumblein

Try this patch: https://github.com/ggerganov/llama.cpp/commit/6460f758dbd472653296044d36bed8c4554988f5

I can confirm that really work!

FNsi avatar Sep 24 '23 00:09 FNsi

@leejet @Green-Sky @ggerganov

I do not know cpp and do not have a solid grasp on how ggml works. , but building the repo with cmake -dggml_clblast=ON seems to work as the GPU utilization goes up and it’s very fast (10s vs 80s per step on a higher end CPU). It does complete all the steps and completes sampling too, but then crashes at line 1505 of ggml-opencl.

If it is a matter of spending time to make this work, is it simple enough for one of you to explain what needs to be done? If so, would be happy to give it a shot but don’t know where to start.

My limited understanding is that sampling is what takes all the effort, so is there a way to maybe switch from GPU to CPU to save the file? Or am I missing some context/knowledge?

Edit: Fixed typo. Flag used is clblast, not openblas.

Use OpenCL on Android, and it gets slower. What device are you using? image

rayrayraykk avatar Nov 10 '23 02:11 rayrayraykk

I applied the patch and then added some ifdef SD_USE_CLBLAST include "ggml-opencl.h" ... etc, edited cmakelist file with bits from llama.cpp's clblast ported over and renamed/re-pointed, then configured with cmake .. -DGGML_OPENBLAS=ON -DGGML_CLBLAST=ON. Now compiled ./sd recognizes my AMD RX 580 GPU and I get about a 30% speed up. Not a huge increase since that's the same number of CPU threads + GPU, but my GPU is pretty old too. And it does seem take some load off CPU which is nice. Thanks!

superkuh avatar Dec 26 '23 04:12 superkuh