llama.cpp
llama.cpp copied to clipboard
Can we finetune existing models via the ideas of QLORA
QLORA gives a idea that we can still use quantized weights and LoRA to fine tune a model. As backward caculation is most done already, maybe we can look at this:
- Evaluate if we need to do double quantize to further optimize for VRAM usage over speed.
- implement LoRA finetune in llama? or a standalone application?
- Add gpu offload support to compute grad.
I would also like to see this.
In my company we often have confidential data under export control. So the cloud is not an option.
Getting a server with a decent GPU into a data centre close to the data could take weeks or just outright be blocked.
So GGML with QLoRa running on our dev machines would give us the opportunity to build proofs of concepts.
I'm thinking we should demonstrate full parameter training as well as fine tuning on Mac Studio with M2 Ultra.
It offers a lot of unified RAM (up to 192 GB) and now with Metal support in ggml
, we should get reasonable performance
I am reading this paper: https://arxiv.org/abs/2306.09782 Full Parameter Fine-tuning for Large Language Models with Limited Resources
It proposed a new optimizer which is saving lot compared to SGD/ADAM.
I've got a Mac Studio with M2 Ultra and 192 GB as of yesterday and I'm very interested in this topic.
The debate between fine tuning on internal documents vs searching in a vector database and feeding docs through to a prompt is a key question on several projects for us, and having a feasible way to fine-tune on this machine for testing would be amazing.
Let me know how I can be helpful, I'm happy to run test workloads against this
This issue was closed because it has been inactive for 14 days since being marked as stale.