guidance icon indicating copy to clipboard operation
guidance copied to clipboard

llama-cpp-python support

Open Maximilian-Winter opened this issue 2 years ago • 36 comments

I have added llama-cpp-python support. I also created a example notebook on how to use it!

Maximilian-Winter avatar May 20 '23 08:05 Maximilian-Winter

@microsoft-github-policy-service agree

Maximilian-Winter avatar May 20 '23 08:05 Maximilian-Winter

Thank you!

alxspiker avatar May 20 '23 18:05 alxspiker

@alxspiker I found a couple of problems with my implementation and are fixing them right now!

Maximilian-Winter avatar May 20 '23 18:05 Maximilian-Winter

Anyway to support mmap? Seems like its not.

alxspiker avatar May 20 '23 18:05 alxspiker

llama_print_timings:        load time =  4772.70 ms
llama_print_timings:      sample time =     3.01 ms /     1 runs   (    3.01 ms per run)
llama_print_timings: prompt eval time = 11246.46 ms /    23 tokens (  488.98 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 12235.80 ms
Traceback (most recent call last):
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\_program_executor.py", line 94, in run
    await self.visit(self.parse_tree)
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\_program_executor.py", line 428, in visit
    visited_children.append(await self.visit(child, inner_next_node, inner_next_next_node, inner_prev_node, node, parent_node))
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\_program_executor.py", line 428, in visit
    visited_children.append(await self.visit(child, inner_next_node, inner_next_next_node, inner_prev_node, node, parent_node))
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\_program_executor.py", line 395, in visit
    command_output = await command_function(*positional_args, **named_args)
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\library\_select.py", line 158, in select
    option_logprobs = await recursive_select("")
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\library\_select.py", line 131, in recursive_select
    sub_logprobs = await recursive_select(rec_prefix, allow_token_extension=False)
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\library\_select.py", line 131, in recursive_select
    sub_logprobs = await recursive_select(rec_prefix, allow_token_extension=False)
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\library\_select.py", line 131, in recursive_select
    sub_logprobs = await recursive_select(rec_prefix, allow_token_extension=False)
  [Previous line repeated 477 more times]
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\library\_select.py", line 107, in recursive_select
    gen_obj = await parser.llm_session(
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\llms\_llama_cpp.py", line 244, in __call__
    key = self._cache_key(locals())
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\llms\_llm.py", line 76, in _cache_key
    key = self._gen_key(args_dict)
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\llms\_llm.py", line 69, in _gen_key
    return "_---_".join([str(v) for v in ([args_dict[k] for k in var_names] + [self.llm.model_name, self.llm.__class__.__name__, self.llm.cache_version])])
  File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\llms\_llm.py", line 69, in <listcomp>
    return "_---_".join([str(v) for v in ([args_dict[k] for k in var_names] + [self.llm.model_name, self.llm.__class__.__name__, self.llm.cache_version])])
RecursionError: maximum recursion depth exceeded while getting the repr of an object

Error in program:  maximum recursion depth exceeded while getting the repr of an object

alxspiker avatar May 20 '23 18:05 alxspiker

@alxspiker I have fixed all errors on my side but couldn't reproduce your error, but I added nmap to the settings!

Maximilian-Winter avatar May 20 '23 20:05 Maximilian-Winter

@alxspiker At the moment you have to use my fork of llama-cpp-python to use guidance. You will find the fork here: https://github.com/Maximilian-Winter/llama-cpp-python

Maximilian-Winter avatar May 20 '23 20:05 Maximilian-Winter

Related PR in llama-cpp-python: https://github.com/abetlen/llama-cpp-python/pull/252

It would be awesome to use guidance with llama.cpp! I'm excited :)

Mihaiii avatar May 20 '23 21:05 Mihaiii

@Maximilian-Winter this is great, thanks! It will probably be Monday before I can review it properly. Are the there any basic units tests we can add for this? (with small LMs that don't slow down the test process too much) ...might not be possible with LLaMA, but even a file with test that only run locally would be good so we can make sure this stays working :)

slundberg avatar May 20 '23 22:05 slundberg

(I also just approved the unit tests to run for this)

slundberg avatar May 20 '23 22:05 slundberg

@slundberg I have added a test file in the tests/llms folder called "test_llamacpp.py". I used the test_transformers file as a template.

Maximilian-Winter avatar May 21 '23 00:05 Maximilian-Winter

After many attempts I could not get the role chat to work

I've use this code:

import re
import guidance

# define the model we will use

settings = guidance.llms.LlamaCppSettings()
settings.n_gpu_layers = 14
settings.n_threads = 16
settings.n_ctx = 1024
settings.use_mlock = True
settings.model = "path/to/model"
# Create a LlamaCpp instance and pass the settings to it.
llama = guidance.llms.LlamaCpp(settings=settings)
guidance.llm = llama
def parse_best(prosandcons, options):
    best = int(re.findall(r'Best=(\d+)', prosandcons)[0])
    return options[best]

create_plan = guidance('''
{{#system~}}
You are a helpful assistant.
{{~/system}}

{{! generate five potential ways to accomplish a goal }}
{{#block hidden=True}}
{{#user~}}
I want to {{goal}}.
{{~! generate potential options ~}}
Can you please generate one option for how to accomplish this?
Please make the option very short, at most one line.
{{~/user}}

{{#assistant~}}
{{gen 'options' n=5 temperature=1.0 max_tokens=500}}
{{~/assistant}}
{{/block}}

{{! generate pros and cons for each option and select the best option }}
{{#block hidden=True}}
{{#user~}}
I want to {{goal}}.

Can you please comment on the pros and cons of each of the following options, and then pick the best option?
---{{#each options}}
Option {{@index}}: {{this}}{{/each}}
---
Please discuss each option very briefly (one line for pros, one for cons), and end by saying Best=X, where X is the best option.
{{~/user}}

{{#assistant~}}
{{gen 'prosandcons' temperature=0.0 max_tokens=500}}
{{~/assistant}}
{{/block}}

{{! generate a plan to accomplish the chosen option }}
{{#user~}}
I want to {{goal}}.
{{~! Create a plan }}
Here is my plan:
{{parse_best prosandcons options}}
Please elaborate on this plan, and tell me how to best accomplish it.
{{~/user}}

{{#assistant~}}
{{gen 'plan' max_tokens=500}}
{{~/assistant}}''')
out = create_plan(
    goal='read more books',
    parse_best=parse_best # a custom python function we call in the program
)

DanielusG avatar May 21 '23 18:05 DanielusG

I appreciate your work, sir. Looking forward to trying this out. Would be amazing to run local ggml models with guidance.

sadaisystems avatar May 22 '23 11:05 sadaisystems

Thanks about @Maximilian-Winter for working on this! After reviewing where we are at right now, it looks like the llama-cpp-python package supports the caching we need for guidance acceleration (which is great!), it does not however support passing transformer-style logit processors. We need that to support logit_bias and pattern guides. We could get away without pattern guides for now, but logit_bias is important for select.

Looking through the code it seems like all the needed changes are just in the python package, not the C++ code. Are you (or someone else) able to work with the llama-cpp-python package to get the needed support?

Thanks again, and let me know if I missed something in my assessment above :)

slundberg avatar May 22 '23 22:05 slundberg

I have fixed most errors without token healing(only produce garbage right now) Right now I have to use the original tokenizer from huggingface. Chat mode isn't avaible right now. But patterns, select and stop should work! You can find a basic example in the notebook.

Maximilian-Winter avatar May 24 '23 20:05 Maximilian-Winter

I also implemented logit and stop processors into llama-cpp-python and waiting for my pull request to be accepted. https://github.com/abetlen/llama-cpp-python/pull/271

Maximilian-Winter avatar May 24 '23 20:05 Maximilian-Winter

Thanks! Looking good. I'll take a pass once the PR for llama cpp merges

slundberg avatar May 24 '23 20:05 slundberg

@slundberg I added a configurable chat mode implementation and added code as a example to the notebook.

Maximilian-Winter avatar May 24 '23 23:05 Maximilian-Winter

@slundberg My logit processor and stop criteria extension was merged into llama-cpp-python

Maximilian-Winter avatar May 26 '23 17:05 Maximilian-Winter

Excellent! I am working to push support for programatic streaming, then I'll dig into this. Hope to get through it before the long weekend but we shall see.

slundberg avatar May 26 '23 20:05 slundberg

@Maximilian-Winter I've published a new version of llama-cpp-python (v0.1.55) that includes the stopping_criteria and logits_processor, I've tested it with your branch and it works.

Just made two changes


            from llama_cpp import LogitsProcessorList, StoppingCriteriaList
            last_token_str = ""
            processors = LogitsProcessorList()
            stoppers = StoppingCriteriaList()

and

                logits_processor=processors,
                stopping_criteria=stoppers,

abetlen avatar May 26 '23 21:05 abetlen

@abetlen I have implemented it this way:

 logits_processor=LogitsProcessorList(processors),
 stopping_criteria=StoppingCriteriaList(stoppers),

Maximilian-Winter avatar May 27 '23 06:05 Maximilian-Winter

@abetlen I have implemented it this way:

 logits_processor=LogitsProcessorList(processors),
 stopping_criteria=StoppingCriteriaList(stoppers),

That works too as LogitsProcessorList just extends list

abetlen avatar May 27 '23 08:05 abetlen

@slundberg Tests are failing because llama-cpp-python is not there. Wasn't sure about adding it as a dependency. How should I handle that and the existence of a model to test?

Maximilian-Winter avatar May 27 '23 09:05 Maximilian-Winter

We can add the dependency to the test section of setup.py. As for a model download, we can try and find a really small model to test with and then download it in the test setup if not already present. Alternatively can test just locally, and skip the tests when run on GitHub runners (like already happens for OpenAI models).

Great work! I went through everything and made a few changes, see what you think (I am about to push). Here are some highlights:

  1. I changed the settings to just be kwargs instead of separate type. That aligns better with the other LLM objects and I think leads to cleaner code for most setups (you can always define a dict first if you want and then pass as kwargs).
  2. I changed tokenizer_name to tokenizer so people could pass object as well as strings. However, I am not sure if long term we can get the tokenizer from llama-cpp-python directly. Is that possible?

One error that seems to remain is that when we turn on streaming for the generation call I get the following error that seems to imply an API difference with hugging face. Any thoughts? (note we can avoid this error when silent=True since then we don't stream by default) TypeError: create_completion() got an unexpected keyword argument 'streamer'

Another error is that llm.role_end (and role_start) are meant to be callable so right now that chat example fails. Should be a fairly easy fix I think.

Thanks!!

slundberg avatar May 27 '23 22:05 slundberg

@slundberg I have implemented proper role_end again, also implemented streaming support.

Maximilian-Winter avatar May 28 '23 13:05 Maximilian-Winter

@slundberg I think the best way would be to test just locally. The smallest model right now is a 7B parameter model which is already 3.8gb of memory.

Maximilian-Winter avatar May 29 '23 03:05 Maximilian-Winter

Just a note here, I was still getting some tokenization issues and realized it is going to be hard to maintain so much code that is similar between transformers and llamacpp, so I am going to try and push a proposal to share more code tonight.

slundberg avatar May 29 '23 21:05 slundberg

I pushed a proposal in the form of LlamaCpp2, along with lots of updates to Transformers that are related because we will want to depend on them. I think we need to inherit from the Transformers LLM class because otherwise we duplicate lots of code that is tricky and should only live in one place :)

LlamaCpp2 does not work fully yet, but I am pushing to to see what you think @Maximilian-Winter.

thanks again for all the hard work pushing on this :)

slundberg avatar May 31 '23 05:05 slundberg

@slundberg Will take a look later today

Maximilian-Winter avatar May 31 '23 07:05 Maximilian-Winter