fenixlam
fenixlam
update: Ok I test it with 12threads in both GPT4All() and LlamaCppEmbeddings. Its speed has huge increase from 527 seconds to 216 seconds. I see there is GPT4All() and RetrievalQA.from_chain_type()....
I test the application and it require 32GB vram to run it, so it is best to use runpod or other GPU service provided server to run the program.
My anaconda python 3.10.9 does not have this problem... so... that could be a workaround?
I think the most important is... how you find these parameters?? I have tested it to generate a text paragraph and it Looks good!
@wal58 For 13B model, you can just download the 13B model and load it as 7B model by parameter > ./chat -m [your 13B model]. I remember 13B model's base...
try to use ./chat -n 4096 ?
add a parameter to allow the user to choose to use GPU or CPU... even the directml XD would be the best choice.
https://huggingface.co/Pi3141/alpaca-lora-13B-ggml/tree/main ?