llama.cpp
llama.cpp copied to clipboard
[Enhancement] Simultaneous CLBLAS/CUBLAS instances.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ x] I carefully followed the README.md.
- [ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [ ]x I reviewed the Discussions, and have a new bug or useful enhancement to share.
Enhancement
If not already possible through a config I missed, would offloading some layers to CLBLAS and other layers to CUBLAS be viable? Or maybe offloading layers to multiple CLBLAS devices?
A common hardware config is a CPU with an IGP + discrete gpu, and this would allow the IGP to be utilized on systems with weak CPUs and low-vram dGPUs. And much more powerful, 4 channel IGPs are rumored to be in development at Intel/AMD.
With the extra transfers and possible CPU bandwidth starvation, this may or may not even improve performance much... I'm not sure.
I like the idea of this because many folk will be scraping together whatever RAM, old or new or different GPU hardware they can find to maximise VRAM and throughput / model size (and having clarity of specification would help with this as well as maybe future things like chaining across machines).
Having a clear way of specifying which layers go to which device might also help debugging any problems with code or performance on different GPUs because anyone with both could simply try relative throughout switching in different layers of model to different devices and running a test again.
Also, while I am here, is simultaneous OpenBLAS/CUBLAS possible? I can't build with both at the same time, but it seems like OpenBLAS would be beneficial for CPU offloading unless CUBLAS is replicating that functionality.
I don't think that will work fine though. Many copys from devices will simply reduce the speed.
Hmmm, does CLBlast reduce generation speed on IGPs now?
I would think the transfers would be fine over 1 PCIe bus and to 1 IGP.
This issue was closed because it has been inactive for 14 days since being marked as stale.