gpt4all
gpt4all copied to clipboard
How to train the model with my own files
I am new to LLMs and trying to figure out how to train the model with a bunch of files.
I want to train the model with my files (living in a folder on my laptop) and then be able to use the model to ask questions and get answers.
With OpenAI, folks have suggested using their Embeddings API, which creates chunks of vectors and then has the model work on those. Would that be a similar approach one would use here? Given that I have the model locally, I was hoping I don't need to use OpenAI Embeddings API and train the model locally.
Any help much appreciated. Thanks!
My understanding is that embeddings and retraining (fine-tuning) are different. If you just want extra info, you can embed, if you want new knowledge or style, you probably need to fine-tune.
See https://www.youtube.com/watch?v=8ZW1E017VEc&t=2112s
My understanding is that embeddings and retraining (fine-tuning) are different. If you just want extra info, you can embed, if you want new knowledge or style, you probably need to fine-tune.
See https://www.youtube.com/watch?v=8ZW1E017VEc&t=2112s
Thanks! Looks like for normal use cases, embeddings are the way to go. I have used Langchain to create embeddings with OoenAI. However, like I mentioned before to create the embeddings, in that scenario, you talk to OpenAI Embeddings API. Do you know of any local python libraries that creates embeddings?
I saw a few posts about it, e.g. https://github.com/nomic-ai/gpt4all/issues/173#issuecomment-1496681937 My understanding is you can use gpt4all with langchain https://python.langchain.com/en/latest/modules/models/llms/integrations/gpt4all.html and use indexes as https://python.langchain.com/en/latest/modules/indexes/getting_started.html
I personally have to retrain, since I need a new language - not even sure that's possible yet.
I have the same problem. I need to train gpt4all with the BWB dataset (a large-scale document-level Chinese--English parallel dataset for machine translations). Is there any guide on how to do this?
@paplorinc Those are embeddings. You must pick a service to create embeddings and then have the model query on those embeddings.
I am seeking information on extending the modal's knowledge to include our data. I understand that such operations will require high-end GPUs and what not, but if that is the route folks want to go, is there documentation on how to approach and do modal training?
Yes, we can use a combination of retraining, fine tuning and embedding, each having different effect. I'm currently fine-tuning one in a Colab Pro+ notebook - it requires >40GB video card, >200 GB free space (or a batch size of 1 and at most 2 epocks) and an insane amount of training data and time. And it also costs a lot for now, so you may want to try embeddings first.
@paplorinc
I'm currently fine-tuning one in a Colab Pro+ notebook - it requires >40GB video card, >200 GB free space (or a batch size of 1 and at most 2 epocks) and an insane amount of training data and time. And it also costs a lot for now, so you may want to try embeddings first.
I was considering buying a Mac with M2 Max and 96 Gb unified mem. On paper it looks like tailor made for GPT4ALL, but I'm not sure. Do you think it would be a good choice? 🤔
@Emasoft some models, well, let me say small models, allow you to switch to CPUs to train the data instead of GPUs. For example, have a look at NanoGPT. It could be done, but I am no expert. Especially with Apple's unified architecture, if the training process is optimized for Apple's M1/M2, then there is a chance that having that 96GB unified memory will be good for training small models and getting started.
Not (yet) an expert either, but it's probably cheaper and more convenient to rent those GPUs for the duration of the training only - you won't really be able to use the laptop during training, which may take hours/days...
For example this guy trained his foreign model (only a few thousand prompts, though) cheaply at https://youtu.be/yTROqe8T_eA?t=1061
Look what I just found: ~https://github.com/lxe/simple-llm-finetuner~ https://github.com/zetavg/LLaMA-LoRA-Tuner With slight modification you can get a public link in colab to a UI where you can just add your data and fine-tune it instantly!
Look what I just found: ~https://github.com/lxe/simple-llm-finetuner~ https://github.com/zetavg/LLaMA-LoRA-Tuner
With slight modification you can get a public link in colab to a UI where you can just add your data and fine-tune it instantly!
Thanks! If you don't mind, would you able to document your process? Will certainly look into this!
Stale, please open a new, updated issue if this is still relevant to you.
Seems solved
I wonder one day will it be possible to train minor models which are locally trained just on some 8GB ram with some 50-60 pdfs that will be more useful than big models and GPU cards
I wonder one day will it be possible to train minor models which are locally trained just on some 8GB ram with some 50-60 pdfs that will be more useful than big models and GPU cards
This is a great topic for the Discord or the Discussions tab. Issues are better for requesting some specific enhancement to GPT4All.