llama-cpp-python support
I have added llama-cpp-python support. I also created a example notebook on how to use it!
@microsoft-github-policy-service agree
Thank you!
@alxspiker I found a couple of problems with my implementation and are fixing them right now!
Anyway to support mmap? Seems like its not.
llama_print_timings: load time = 4772.70 ms
llama_print_timings: sample time = 3.01 ms / 1 runs ( 3.01 ms per run)
llama_print_timings: prompt eval time = 11246.46 ms / 23 tokens ( 488.98 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 12235.80 ms
Traceback (most recent call last):
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\_program_executor.py", line 94, in run
await self.visit(self.parse_tree)
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\_program_executor.py", line 428, in visit
visited_children.append(await self.visit(child, inner_next_node, inner_next_next_node, inner_prev_node, node, parent_node))
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\_program_executor.py", line 428, in visit
visited_children.append(await self.visit(child, inner_next_node, inner_next_next_node, inner_prev_node, node, parent_node))
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\_program_executor.py", line 395, in visit
command_output = await command_function(*positional_args, **named_args)
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\library\_select.py", line 158, in select
option_logprobs = await recursive_select("")
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\library\_select.py", line 131, in recursive_select
sub_logprobs = await recursive_select(rec_prefix, allow_token_extension=False)
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\library\_select.py", line 131, in recursive_select
sub_logprobs = await recursive_select(rec_prefix, allow_token_extension=False)
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\library\_select.py", line 131, in recursive_select
sub_logprobs = await recursive_select(rec_prefix, allow_token_extension=False)
[Previous line repeated 477 more times]
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\library\_select.py", line 107, in recursive_select
gen_obj = await parser.llm_session(
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\llms\_llama_cpp.py", line 244, in __call__
key = self._cache_key(locals())
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\llms\_llm.py", line 76, in _cache_key
key = self._gen_key(args_dict)
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\llms\_llm.py", line 69, in _gen_key
return "_---_".join([str(v) for v in ([args_dict[k] for k in var_names] + [self.llm.model_name, self.llm.__class__.__name__, self.llm.cache_version])])
File "C:\Users\Haley The Retard\Documents\GitHub\AI-X\guidance\llms\_llm.py", line 69, in <listcomp>
return "_---_".join([str(v) for v in ([args_dict[k] for k in var_names] + [self.llm.model_name, self.llm.__class__.__name__, self.llm.cache_version])])
RecursionError: maximum recursion depth exceeded while getting the repr of an object
Error in program: maximum recursion depth exceeded while getting the repr of an object
@alxspiker I have fixed all errors on my side but couldn't reproduce your error, but I added nmap to the settings!
@alxspiker At the moment you have to use my fork of llama-cpp-python to use guidance. You will find the fork here: https://github.com/Maximilian-Winter/llama-cpp-python
Related PR in llama-cpp-python: https://github.com/abetlen/llama-cpp-python/pull/252
It would be awesome to use guidance with llama.cpp! I'm excited :)
@Maximilian-Winter this is great, thanks! It will probably be Monday before I can review it properly. Are the there any basic units tests we can add for this? (with small LMs that don't slow down the test process too much) ...might not be possible with LLaMA, but even a file with test that only run locally would be good so we can make sure this stays working :)
(I also just approved the unit tests to run for this)
@slundberg I have added a test file in the tests/llms folder called "test_llamacpp.py". I used the test_transformers file as a template.
After many attempts I could not get the role chat to work
I've use this code:
import re
import guidance
# define the model we will use
settings = guidance.llms.LlamaCppSettings()
settings.n_gpu_layers = 14
settings.n_threads = 16
settings.n_ctx = 1024
settings.use_mlock = True
settings.model = "path/to/model"
# Create a LlamaCpp instance and pass the settings to it.
llama = guidance.llms.LlamaCpp(settings=settings)
guidance.llm = llama
def parse_best(prosandcons, options):
best = int(re.findall(r'Best=(\d+)', prosandcons)[0])
return options[best]
create_plan = guidance('''
{{#system~}}
You are a helpful assistant.
{{~/system}}
{{! generate five potential ways to accomplish a goal }}
{{#block hidden=True}}
{{#user~}}
I want to {{goal}}.
{{~! generate potential options ~}}
Can you please generate one option for how to accomplish this?
Please make the option very short, at most one line.
{{~/user}}
{{#assistant~}}
{{gen 'options' n=5 temperature=1.0 max_tokens=500}}
{{~/assistant}}
{{/block}}
{{! generate pros and cons for each option and select the best option }}
{{#block hidden=True}}
{{#user~}}
I want to {{goal}}.
Can you please comment on the pros and cons of each of the following options, and then pick the best option?
---{{#each options}}
Option {{@index}}: {{this}}{{/each}}
---
Please discuss each option very briefly (one line for pros, one for cons), and end by saying Best=X, where X is the best option.
{{~/user}}
{{#assistant~}}
{{gen 'prosandcons' temperature=0.0 max_tokens=500}}
{{~/assistant}}
{{/block}}
{{! generate a plan to accomplish the chosen option }}
{{#user~}}
I want to {{goal}}.
{{~! Create a plan }}
Here is my plan:
{{parse_best prosandcons options}}
Please elaborate on this plan, and tell me how to best accomplish it.
{{~/user}}
{{#assistant~}}
{{gen 'plan' max_tokens=500}}
{{~/assistant}}''')
out = create_plan(
goal='read more books',
parse_best=parse_best # a custom python function we call in the program
)
I appreciate your work, sir. Looking forward to trying this out. Would be amazing to run local ggml models with guidance.
Thanks about @Maximilian-Winter for working on this! After reviewing where we are at right now, it looks like the llama-cpp-python package supports the caching we need for guidance acceleration (which is great!), it does not however support passing transformer-style logit processors. We need that to support logit_bias and pattern guides. We could get away without pattern guides for now, but logit_bias is important for select.
Looking through the code it seems like all the needed changes are just in the python package, not the C++ code. Are you (or someone else) able to work with the llama-cpp-python package to get the needed support?
Thanks again, and let me know if I missed something in my assessment above :)
I have fixed most errors without token healing(only produce garbage right now) Right now I have to use the original tokenizer from huggingface. Chat mode isn't avaible right now. But patterns, select and stop should work! You can find a basic example in the notebook.
I also implemented logit and stop processors into llama-cpp-python and waiting for my pull request to be accepted. https://github.com/abetlen/llama-cpp-python/pull/271
Thanks! Looking good. I'll take a pass once the PR for llama cpp merges
@slundberg I added a configurable chat mode implementation and added code as a example to the notebook.
@slundberg My logit processor and stop criteria extension was merged into llama-cpp-python
Excellent! I am working to push support for programatic streaming, then I'll dig into this. Hope to get through it before the long weekend but we shall see.
@Maximilian-Winter I've published a new version of llama-cpp-python (v0.1.55) that includes the stopping_criteria and logits_processor, I've tested it with your branch and it works.
Just made two changes
from llama_cpp import LogitsProcessorList, StoppingCriteriaList
last_token_str = ""
processors = LogitsProcessorList()
stoppers = StoppingCriteriaList()
and
logits_processor=processors,
stopping_criteria=stoppers,
@abetlen I have implemented it this way:
logits_processor=LogitsProcessorList(processors),
stopping_criteria=StoppingCriteriaList(stoppers),
@abetlen I have implemented it this way:
logits_processor=LogitsProcessorList(processors), stopping_criteria=StoppingCriteriaList(stoppers),
That works too as LogitsProcessorList just extends list
@slundberg Tests are failing because llama-cpp-python is not there. Wasn't sure about adding it as a dependency. How should I handle that and the existence of a model to test?
We can add the dependency to the test section of setup.py. As for a model download, we can try and find a really small model to test with and then download it in the test setup if not already present. Alternatively can test just locally, and skip the tests when run on GitHub runners (like already happens for OpenAI models).
Great work! I went through everything and made a few changes, see what you think (I am about to push). Here are some highlights:
- I changed the settings to just be kwargs instead of separate type. That aligns better with the other LLM objects and I think leads to cleaner code for most setups (you can always define a dict first if you want and then pass as kwargs).
- I changed tokenizer_name to tokenizer so people could pass object as well as strings. However, I am not sure if long term we can get the tokenizer from llama-cpp-python directly. Is that possible?
One error that seems to remain is that when we turn on streaming for the generation call I get the following error that seems to imply an API difference with hugging face. Any thoughts? (note we can avoid this error when silent=True since then we don't stream by default)
TypeError: create_completion() got an unexpected keyword argument 'streamer'
Another error is that llm.role_end (and role_start) are meant to be callable so right now that chat example fails. Should be a fairly easy fix I think.
Thanks!!
@slundberg I have implemented proper role_end again, also implemented streaming support.
@slundberg I think the best way would be to test just locally. The smallest model right now is a 7B parameter model which is already 3.8gb of memory.
Just a note here, I was still getting some tokenization issues and realized it is going to be hard to maintain so much code that is similar between transformers and llamacpp, so I am going to try and push a proposal to share more code tonight.
I pushed a proposal in the form of LlamaCpp2, along with lots of updates to Transformers that are related because we will want to depend on them. I think we need to inherit from the Transformers LLM class because otherwise we duplicate lots of code that is tricky and should only live in one place :)
LlamaCpp2 does not work fully yet, but I am pushing to to see what you think @Maximilian-Winter.
thanks again for all the hard work pushing on this :)
@slundberg Will take a look later today