gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

How to train the model with my own files

Open chakkaradeep opened this issue 2 years ago • 11 comments
trafficstars

I am new to LLMs and trying to figure out how to train the model with a bunch of files.

I want to train the model with my files (living in a folder on my laptop) and then be able to use the model to ask questions and get answers.

With OpenAI, folks have suggested using their Embeddings API, which creates chunks of vectors and then has the model work on those. Would that be a similar approach one would use here? Given that I have the model locally, I was hoping I don't need to use OpenAI Embeddings API and train the model locally.

Any help much appreciated. Thanks!

chakkaradeep avatar Apr 16 '23 07:04 chakkaradeep

My understanding is that embeddings and retraining (fine-tuning) are different. If you just want extra info, you can embed, if you want new knowledge or style, you probably need to fine-tune.

See https://www.youtube.com/watch?v=8ZW1E017VEc&t=2112s

l0rinc avatar Apr 16 '23 11:04 l0rinc

My understanding is that embeddings and retraining (fine-tuning) are different. If you just want extra info, you can embed, if you want new knowledge or style, you probably need to fine-tune.

See https://www.youtube.com/watch?v=8ZW1E017VEc&t=2112s

Thanks! Looks like for normal use cases, embeddings are the way to go. I have used Langchain to create embeddings with OoenAI. However, like I mentioned before to create the embeddings, in that scenario, you talk to OpenAI Embeddings API. Do you know of any local python libraries that creates embeddings?

chakkaradeep avatar Apr 16 '23 16:04 chakkaradeep

I saw a few posts about it, e.g. https://github.com/nomic-ai/gpt4all/issues/173#issuecomment-1496681937 My understanding is you can use gpt4all with langchain https://python.langchain.com/en/latest/modules/models/llms/integrations/gpt4all.html and use indexes as https://python.langchain.com/en/latest/modules/indexes/getting_started.html

I personally have to retrain, since I need a new language - not even sure that's possible yet.

l0rinc avatar Apr 16 '23 18:04 l0rinc

I have the same problem. I need to train gpt4all with the BWB dataset (a large-scale document-level Chinese--English parallel dataset for machine translations). Is there any guide on how to do this?

Emasoft avatar Apr 17 '23 06:04 Emasoft

@paplorinc Those are embeddings. You must pick a service to create embeddings and then have the model query on those embeddings.

I am seeking information on extending the modal's knowledge to include our data. I understand that such operations will require high-end GPUs and what not, but if that is the route folks want to go, is there documentation on how to approach and do modal training?

chakkaradeep avatar Apr 17 '23 16:04 chakkaradeep

Yes, we can use a combination of retraining, fine tuning and embedding, each having different effect. I'm currently fine-tuning one in a Colab Pro+ notebook - it requires >40GB video card, >200 GB free space (or a batch size of 1 and at most 2 epocks) and an insane amount of training data and time. And it also costs a lot for now, so you may want to try embeddings first.

l0rinc avatar Apr 17 '23 17:04 l0rinc

@paplorinc

I'm currently fine-tuning one in a Colab Pro+ notebook - it requires >40GB video card, >200 GB free space (or a batch size of 1 and at most 2 epocks) and an insane amount of training data and time. And it also costs a lot for now, so you may want to try embeddings first.

I was considering buying a Mac with M2 Max and 96 Gb unified mem. On paper it looks like tailor made for GPT4ALL, but I'm not sure. Do you think it would be a good choice? 🤔

Emasoft avatar Apr 17 '23 18:04 Emasoft

@Emasoft some models, well, let me say small models, allow you to switch to CPUs to train the data instead of GPUs. For example, have a look at NanoGPT. It could be done, but I am no expert. Especially with Apple's unified architecture, if the training process is optimized for Apple's M1/M2, then there is a chance that having that 96GB unified memory will be good for training small models and getting started.

chakkaradeep avatar Apr 17 '23 18:04 chakkaradeep

Not (yet) an expert either, but it's probably cheaper and more convenient to rent those GPUs for the duration of the training only - you won't really be able to use the laptop during training, which may take hours/days...

For example this guy trained his foreign model (only a few thousand prompts, though) cheaply at https://youtu.be/yTROqe8T_eA?t=1061

l0rinc avatar Apr 17 '23 18:04 l0rinc

Look what I just found: ~https://github.com/lxe/simple-llm-finetuner~ https://github.com/zetavg/LLaMA-LoRA-Tuner With slight modification you can get a public link in colab to a UI where you can just add your data and fine-tune it instantly!

l0rinc avatar Apr 17 '23 20:04 l0rinc

Look what I just found: ~https://github.com/lxe/simple-llm-finetuner~ https://github.com/zetavg/LLaMA-LoRA-Tuner

With slight modification you can get a public link in colab to a UI where you can just add your data and fine-tune it instantly!

Thanks! If you don't mind, would you able to document your process? Will certainly look into this!

chakkaradeep avatar Apr 17 '23 23:04 chakkaradeep

Stale, please open a new, updated issue if this is still relevant to you.

niansa avatar Aug 11 '23 11:08 niansa

Seems solved

niansa avatar Aug 11 '23 11:08 niansa

I wonder one day will it be possible to train minor models which are locally trained just on some 8GB ram with some 50-60 pdfs that will be more useful than big models and GPU cards

neel-jay avatar Jun 12 '24 13:06 neel-jay

I wonder one day will it be possible to train minor models which are locally trained just on some 8GB ram with some 50-60 pdfs that will be more useful than big models and GPU cards

This is a great topic for the Discord or the Discussions tab. Issues are better for requesting some specific enhancement to GPT4All.

cebtenzzre avatar Jun 12 '24 14:06 cebtenzzre