WikiChat icon indicating copy to clipboard operation
WikiChat copied to clipboard

Things to adjust when loading Local LLMs other than the distilled models

Open turjo-001 opened this issue 9 months ago • 2 comments

Hi. Thanks for working on this project.

I was looking into optimizing WikiChat on hardware limited devices using local LLMs, preferably low parameter models.

What components should i look into to run any other models other than the Distilled Llama models? I've tried loading Mistral and other llama models but every time i query, demo fails with the error Search prompt's output is invalid: [answer from the LLM]

Is it anyway related to how the prompt files behave with the LLM?

Any help will be appreciated!

turjo-001 avatar May 12 '24 10:05 turjo-001

Hi,

Could you please paste the command you are using to run WikiChat, the models have you been trying it with, and their configuration in llm_config.yaml?

WikiChat has a setting which determines how it sends the prompt to the underlying LLM.

  • In "distilled" mode, the code assumes that the LLM has been fine-tuned with the specific format needed, so it does not send the full instruction or few-shot examples to the LLM, just a short identifier.
  • In the "non-distilled" mode, it will send the full text of the prompt, including full instructions and few-shot examples to the LLM.

Using the incorrect mode will result in unexpected outputs or outputs with incorrect format, so the code will raise an error. The mode can be set by modifying this line in llm_config.yaml: https://github.com/stanford-oval/WikiChat/blob/main/llm_config.yaml#L46 (I fixed a typo just now). You should use none as the prompt format for new models.

Note that the current version of the code does not handle chat format of open-source models that have been instruction tuned (like LLaMA-instruct). And depending on the underlying model's capabilities, they might not be good at following instructions, so they might still give incorrect outputs.

In general, I recommend "distilling" a smaller model, similar to how Llama-2-7b-WikiChat was distilled. I can probably release a smaller distilled model in the future, if that helps.

s-jse avatar May 12 '24 17:05 s-jse

Hi @s-jse, thanks for the detailed reply.

The command used to run demo: make demo engine=local

models I've tried loading so far: TheBloke/Llama-2-7B-32K-Instruct-AWQ, mistralai/Mistral-7B-Instruct-v0.2

and the llm_config.yaml file (only the local part as i've left the other parts unchanged):

# Use this endpoint if you are hosting your language model locally, e.g. via HuggingFace's text-generation-inference library
  - api_type: local
    api_version: null
    api_base: http://0.0.0.0:8080 # replace [ip] and [port] with the ip address and the port number where your inference server is running. You can use 0.0.0.0 for the IP address if your inference server is running on this machine.
    prompt_format: simple # One of simple, alpaca or none. Use simple if you are using zero-shot or few-shot prompting. Use simple or alpaca  if you are using a distilled model, depending on what format the model has been trained with.
    engine_map:
      local: local

i'll try with setting prompt_format = none and see how it works.

turjo-001 avatar May 14 '24 08:05 turjo-001