WikiChat
WikiChat copied to clipboard
Things to adjust when loading Local LLMs other than the distilled models
Hi. Thanks for working on this project.
I was looking into optimizing WikiChat on hardware limited devices using local LLMs, preferably low parameter models.
What components should i look into to run any other models other than the Distilled Llama models? I've tried loading Mistral and other llama models but every time i query, demo fails with the error Search prompt's output is invalid: [answer from the LLM]
Is it anyway related to how the prompt files behave with the LLM?
Any help will be appreciated!
Hi,
Could you please paste the command you are using to run WikiChat, the models have you been trying it with, and their configuration in llm_config.yaml
?
WikiChat has a setting which determines how it sends the prompt to the underlying LLM.
- In "distilled" mode, the code assumes that the LLM has been fine-tuned with the specific format needed, so it does not send the full instruction or few-shot examples to the LLM, just a short identifier.
- In the "non-distilled" mode, it will send the full text of the prompt, including full instructions and few-shot examples to the LLM.
Using the incorrect mode will result in unexpected outputs or outputs with incorrect format, so the code will raise an error.
The mode can be set by modifying this line in llm_config.yaml
: https://github.com/stanford-oval/WikiChat/blob/main/llm_config.yaml#L46 (I fixed a typo just now). You should use none
as the prompt format for new models.
Note that the current version of the code does not handle chat format of open-source models that have been instruction tuned (like LLaMA-instruct). And depending on the underlying model's capabilities, they might not be good at following instructions, so they might still give incorrect outputs.
In general, I recommend "distilling" a smaller model, similar to how Llama-2-7b-WikiChat
was distilled. I can probably release a smaller distilled model in the future, if that helps.
Hi @s-jse, thanks for the detailed reply.
The command used to run demo: make demo engine=local
models I've tried loading so far: TheBloke/Llama-2-7B-32K-Instruct-AWQ
, mistralai/Mistral-7B-Instruct-v0.2
and the llm_config.yaml
file (only the local part as i've left the other parts unchanged):
# Use this endpoint if you are hosting your language model locally, e.g. via HuggingFace's text-generation-inference library
- api_type: local
api_version: null
api_base: http://0.0.0.0:8080 # replace [ip] and [port] with the ip address and the port number where your inference server is running. You can use 0.0.0.0 for the IP address if your inference server is running on this machine.
prompt_format: simple # One of simple, alpaca or none. Use simple if you are using zero-shot or few-shot prompting. Use simple or alpaca if you are using a distilled model, depending on what format the model has been trained with.
engine_map:
local: local
i'll try with setting prompt_format = none
and see how it works.