gpt4all How to train the model with my own files

trafficstars

I am new to LLMs and trying to figure out how to train the model with a bunch of files.

I want to train the model with my files (living in a folder on my laptop) and then be able to use the model to ask questions and get answers.

With OpenAI, folks have suggested using their Embeddings API, which creates chunks of vectors and then has the model work on those. Would that be a similar approach one would use here? Given that I have the model locally, I was hoping I don't need to use OpenAI Embeddings API and train the model locally.

Any help much appreciated. Thanks!

Apr 16 '23 07:04 chakkaradeep

My understanding is that embeddings and retraining (fine-tuning) are different. If you just want extra info, you can embed, if you want new knowledge or style, you probably need to fine-tune.

See https://www.youtube.com/watch?v=8ZW1E017VEc&t=2112s

Apr 16 '23 11:04 l0rinc

My understanding is that embeddings and retraining (fine-tuning) are different. If you just want extra info, you can embed, if you want new knowledge or style, you probably need to fine-tune.

See https://www.youtube.com/watch?v=8ZW1E017VEc&t=2112s

Thanks! Looks like for normal use cases, embeddings are the way to go. I have used Langchain to create embeddings with OoenAI. However, like I mentioned before to create the embeddings, in that scenario, you talk to OpenAI Embeddings API. Do you know of any local python libraries that creates embeddings?

Apr 16 '23 16:04 chakkaradeep

I saw a few posts about it, e.g. https://github.com/nomic-ai/gpt4all/issues/173#issuecomment-1496681937 My understanding is you can use gpt4all with langchain https://python.langchain.com/en/latest/modules/models/llms/integrations/gpt4all.html and use indexes as https://python.langchain.com/en/latest/modules/indexes/getting_started.html

I personally have to retrain, since I need a new language - not even sure that's possible yet.

Apr 16 '23 18:04 l0rinc

I have the same problem. I need to train gpt4all with the BWB dataset (a large-scale document-level Chinese--English parallel dataset for machine translations). Is there any guide on how to do this?

Apr 17 '23 06:04 Emasoft

@paplorinc Those are embeddings. You must pick a service to create embeddings and then have the model query on those embeddings.

I am seeking information on extending the modal's knowledge to include our data. I understand that such operations will require high-end GPUs and what not, but if that is the route folks want to go, is there documentation on how to approach and do modal training?

Apr 17 '23 16:04 chakkaradeep

Yes, we can use a combination of retraining, fine tuning and embedding, each having different effect. I'm currently fine-tuning one in a Colab Pro+ notebook - it requires >40GB video card, >200 GB free space (or a batch size of 1 and at most 2 epocks) and an insane amount of training data and time. And it also costs a lot for now, so you may want to try embeddings first.

Apr 17 '23 17:04 l0rinc

@paplorinc

I'm currently fine-tuning one in a Colab Pro+ notebook - it requires >40GB video card, >200 GB free space (or a batch size of 1 and at most 2 epocks) and an insane amount of training data and time. And it also costs a lot for now, so you may want to try embeddings first.

I was considering buying a Mac with M2 Max and 96 Gb unified mem. On paper it looks like tailor made for GPT4ALL, but I'm not sure. Do you think it would be a good choice? 🤔

Apr 17 '23 18:04 Emasoft

@Emasoft some models, well, let me say small models, allow you to switch to CPUs to train the data instead of GPUs. For example, have a look at NanoGPT. It could be done, but I am no expert. Especially with Apple's unified architecture, if the training process is optimized for Apple's M1/M2, then there is a chance that having that 96GB unified memory will be good for training small models and getting started.

Apr 17 '23 18:04 chakkaradeep

Not (yet) an expert either, but it's probably cheaper and more convenient to rent those GPUs for the duration of the training only - you won't really be able to use the laptop during training, which may take hours/days...

For example this guy trained his foreign model (only a few thousand prompts, though) cheaply at https://youtu.be/yTROqe8T_eA?t=1061

Apr 17 '23 18:04 l0rinc

Look what I just found: ~https://github.com/lxe/simple-llm-finetuner~ https://github.com/zetavg/LLaMA-LoRA-Tuner With slight modification you can get a public link in colab to a UI where you can just add your data and fine-tune it instantly!

Apr 17 '23 20:04 l0rinc

Look what I just found: ~https://github.com/lxe/simple-llm-finetuner~ https://github.com/zetavg/LLaMA-LoRA-Tuner

With slight modification you can get a public link in colab to a UI where you can just add your data and fine-tune it instantly!

Thanks! If you don't mind, would you able to document your process? Will certainly look into this!

Apr 17 '23 23:04 chakkaradeep

Stale, please open a new, updated issue if this is still relevant to you.

Aug 11 '23 11:08 niansa

Seems solved

Aug 11 '23 11:08 niansa

I wonder one day will it be possible to train minor models which are locally trained just on some 8GB ram with some 50-60 pdfs that will be more useful than big models and GPU cards

Jun 12 '24 13:06 neel-jay

I wonder one day will it be possible to train minor models which are locally trained just on some 8GB ram with some 50-60 pdfs that will be more useful than big models and GPU cards

This is a great topic for the Discord or the Discussions tab. Issues are better for requesting some specific enhancement to GPT4All.

Jun 12 '24 14:06 cebtenzzre

gpt4all gpt4all copied to clipboard

How to train the model with my own files

gpt4all
gpt4all copied to clipboard