lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

Is there an interactive mode?

Open BoQsc opened this issue 2 years ago • 12 comments

BoQsc avatar Apr 02 '23 06:04 BoQsc

Hey @BoQsc Could clarify a bit more what you mean by interactive mode? Could you give an example?

awaelchli avatar Apr 02 '23 13:04 awaelchli

In the README file it is shown that each prompt is taken as argument of the program. Interactive Mode is when you interact using prompts in a chat-like behaviour.

Current non-interactive mode presented in the README.md

68747470733a2f2f706c2d7075626c69632d646174612e73332e616d617a6f6e6177732e636f6d2f6173736574735f6c696768746e696e672f4c6c616d615f70696e656170706c652e676966

BoQsc avatar Apr 02 '23 14:04 BoQsc

Right, like: you get a prompt cursor and start chatting. I’d be in favor so we avoid loading repeatedly when testing multiple prompts interactively.

@BoQsc is this something you’d have time to contribute?

lantiga avatar Apr 02 '23 14:04 lantiga

This would be really cool. I was checking termgpt and we can take inspiration from there. I'd love to work on this as a fun project. Let me know @BoQsc or anyone from community wants to collaborate 😄

aniketmaurya avatar Apr 19 '23 14:04 aniketmaurya

A simple while loop with reading input like this

while True:
    prompt = input("Prompt:") 

could already be an acceptable minimal version. I wouldn't go much further than that for the simple demo script unless there is good value. termgpt uses rich to format the output with colors and so on.

awaelchli avatar Apr 19 '23 14:04 awaelchli

Hi, I started playing yesterday with it. As @awaelchli mentioned, that snippet does the job once you load the model.

However, a cool step would be to move towards a chatbot assistant.

Currently the prompt does not contain the past conversation and as such the model cannot reply to questions like "What was the previous question I asked you?", so some way of concatenating all the context of the conversation should be adopted. I tried with the 7B version fine-tuned with the finetune_lora.py script and the problem with that is that the instructions in the fine-tuning stage never contain multiple steps of dialogue. This might result in the model continuing with multiple steps of dialogue in which it also tries to predict the next prompt of the user and so on...

I write this just to say that possible scripts that we could work on are:

  • interactive prompting (for pre-trained, lora-fine-tuned and adapter-fine-tuned models). Objective: avoid the overhead of repeatedly loading the model.
  • chat (for lora-fine-tuned and adapter-fine-tuned). Objective: something close to ChatGPT interface.

nicoladainese96 avatar Apr 20 '23 11:04 nicoladainese96

Yes it would be great

It would be cool to use Textual for the UI https://www.textualize.io/#textual

lantiga avatar Apr 20 '23 18:04 lantiga

Thank you above for some directions. I guess modifying the code in generate_adapter.py like this will work for simple one-step interact mode?

Also, I guess will need to leverage something like ShareGPT data to finetune with multiple steps of dialogue?

generate_adapter.py

......

tokenizer = Tokenizer(tokenizer_path)

    while True:

        prompt = input(">> Prompt:")
        if not prompt:
            break

        sample = {"instruction": prompt, "input": input_instruction}
        prompt = generate_prompt(sample)
        encoded = tokenizer.encode(prompt, bos=True, eos=False, device=model.device)

        print("Inferencing...")

        t0 = time.perf_counter()
        output = generate(
            model,
            idx=encoded,
            max_seq_length=max_new_tokens,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k,
            eos_id=tokenizer.eos_id
        )

        output = tokenizer.decode(output)
        output = output.split("### Response:")[1].strip()
        print(">> lit-llama: ", output)

        t = time.perf_counter() - t0

        print(f"\nTime for inference: {t:.02f} sec total, {max_new_tokens / t:.02f} tokens/sec", file=sys.stderr)
        print(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB", file=sys.stderr)
        print("\n")

......

chakt avatar Apr 21 '23 16:04 chakt

very cool @chakt, wanna open a PR?

aniketmaurya avatar Apr 21 '23 17:04 aniketmaurya

I implemented one in https://github.com/Lightning-AI/lit-stablelm/blob/main/chat.py. It could be copied over to this repository.

carmocca avatar May 08 '23 16:05 carmocca

Just clone the code from lit-parrot chat.py into lit-llama/generate.py ... that will give you an interactive mode.

RDouglasSharp avatar Jun 20 '23 00:06 RDouglasSharp

Can we make it conversation style where it remembers the context from previous prompts. That would be more helpful.

Harsh-raj avatar Oct 27 '23 05:10 Harsh-raj