gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

Getting Slow Inference Time on CPU using GPT4All

Open eshaanagarwal opened this issue 1 year ago • 3 comments

Issue with current documentation:

I have been trying to use GPT4ALL models, especially ggml-gpt4all-j-v1.3-groovy.bin for making my own chatbot that could answer questions about some documents using Langchain.

I had a hard time integrating the model. Can somebody please give some direction for it?

  1. I was getting about 150s inference time on avg per query using above model while i expected it to be fast considering it uses ggml library. Am i doing something wrong? Is it completely normal?
  2. Do we have GPU support for the above models.

My Machine - I7 with 32gb ram but integrated gpu I also tried to run the example of using GPT4All with langchain but it gave similar results.

PLease tell me how can i improve it ?

Idea or request for content:

No response

eshaanagarwal avatar Jun 04 '23 20:06 eshaanagarwal

For now, everything runs completely on the CPU.

  1. Do we have GPU support for the above models.

It's a work-in-progress at this stage.

Inference time depends a lot on the prompt/query size, so I'm not sure if what you described should be considered normal or not. Have you tried just running the chat application? How did you go about with the installation? Did it actually top out your CPU while processing?

cosmic-snow avatar Jun 04 '23 21:06 cosmic-snow

For now, everything runs completely on the CPU.

  1. Do we have GPU support for the above models.

It's a work-in-progress at this stage.

Inference time depends a lot on the prompt/query size, so I'm not sure if what you described should be considered normal or not. Have you tried just running the chat application? How did you go about with the installation? Did it actually top out your CPU while processing?

I actually saw some issue which was related to GPT4ALLGPU and I got confused whether it is even part of the library.

I haven't run the chat application by GPT4ALL by itself but I don't understand. If it's the same models that are under the hood and there isn't any particular reference of speeding up the inference why it is slow.

As the nature of my task, the LLMs has to digest a large number of tokens, but I did not expect the speed to go down on such a scale.

Is there a better way to deal with this situation, I am making a chatbot.

eshaanagarwal avatar Jun 05 '23 03:06 eshaanagarwal

I actually saw some issue which was related to GPT4ALLGPU and I got confused whether it is even part of the library.

It'll happen sometime soon I think. If nothing else helps, wait for that.

I haven't run the chat application by GPT4ALL by itself but I don't understand. If it's the same models that are under the hood and there isn't any particular reference of speeding up the inference why it is slow.

Yes, it's the same. But it'd give you a point of reference, maybe the both of us, that's why I asked. Your initial post was light on details.

Again, did it actually top out your CPU while processing? If not, maybe try increasing the number of threads (set_thread_count()) and maybe even play around with the batch size (n_batch parameter).

Edit: Ah right, I've found something mentioning GPT4ALLGPU -- that was not what I meant and I don't know about that one (i.e. whether it could be made to work with how things currently stand).

cosmic-snow avatar Jun 05 '23 06:06 cosmic-snow

Hey I will provide the details in some time. I haven't tried threading. Can you please give a sample code or way to run it or perform it ? I think if that's possible it might not solve the problem but surely make things tad bit faster ?

Yeah I guess I don't know about GPT4ALLGPU as there isn't any mention on documentation.

eshaanagarwal avatar Jun 06 '23 05:06 eshaanagarwal

I have never tried Langchain myself, so I can't really give you instructions for that. My comments were meant for the Python bindings themselves, on which Langchain builds, I'm sure. Threading is enabled by default, but you can increase the number of threads.

Sorry, but you'll have to figure out how Langchain does that and how to pass options, or reach into and manipulate it yourself.

Alternatives:

  • Use the Python bindings directly.
  • Use the underlying llama.cpp project instead, on which GPT4All builds (with a compatible model). See its Readme, there seem to be some Python bindings for that, too. It already has working GPU support.

cosmic-snow avatar Jun 06 '23 05:06 cosmic-snow

See docs.gpt4all.io for details about why local LLMs may be slow on your computer. GPU support is in development and many issues have been raised about it. Closing.

AndriyMulyar avatar Jun 18 '23 18:06 AndriyMulyar