dspy
dspy copied to clipboard
llamacpp support
Tested with:
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="TheBloke/OpenHermes-2.5-Mistral-7B-GGUF",
filename="openhermes-2.5-mistral-7b.Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=10,
verbose=True
)
llamalm = dspy.LlamaCpp(model="llama", llama_model=llm)
dspy.settings.configure(lm=llamalm)
def summarize_document(document):
summarize = dspy.ChainOfThought('document -> summary')
response = summarize(document=document)
print(response.summary)
if __name__ == "__main__":
summarize_document("""The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page.""")
I'm curious what sort of results you got?
I tried w/ the above and llama-3-8b instruct:
Document: The 2019-2020 season was marked by the COVID-19 pandemic. The pandemic caused widespread disruption to sports events around the world, including professional soccer matches in England. Many leagues were suspended or postponed due to government restrictions on
But when using llama-cpp directly w/ the same model and "Summarize the above" prompt suffix:
The 21-year-old player, Lee, signed a contract with Barnsley FC (Tykes), a newly-promoted team. He previously played for West Ham United (Hammers) and had loan spells at Blackpool and Colchester United in League One last season.
I suspect something about configuring the settings needs to include llama-3 instruct prompt formatting?
@randerzander I'm using it successfully in production with llama 3 (not an instruct variant, though) with typed output:
llm = Llama.from_pretrained(
repo_id="QuantFactory/Meta-Llama-3-8B-GGUF",
filename="Meta-Llama-3-8B.Q5_K_M.gguf",
n_ctx=8192,
n_gpu_layers=-1,
verbose=False
)
lm = dspy.LlamaCpp(
model="llama",
llama_model=llm,
max_tokens=4096,
temperature=0.1
)
dspy.settings.configure(lm=lm)
optimizer_compiled_path = "/location/of/my/optimized/programs.json"
program = CharacterInformation()
if os.path.exists(optimizer_compiled_path):
program.load(path=optimizer_compiled_path)
else:
optimizer = BootstrapFewShot(
metric=reasonable_character_information,
max_bootstrapped_demos=10,
max_labeled_demos=30,
max_rounds=50,
max_errors=100,
)
program = optimizer.compile(CharacterInformation(), trainset=dataset)
program.save(optimizer_compiled_path)
result = program(premise)
return json.dumps([model.model_dump() for model in result.answer.characters])
```
This is running on an in-house server using Tesla P100s for inference.
If it drops kwargs[“n”], it should call llama.cpp multiple times to generate the requested number of completions. Some of the optimizers expect multiple completions. You could potentially speed this up a bit by saving the LLM state and restoring it rather than reprocessing the prompt.
How does speed and accuracy compare to starting the native llama.cpp server and calling it from DSPy’s OpenAI interface?
Thanks for the PR @codysnider ! Left a few comments.
Could you run ruff check . --fix-only
to fix the failing test and push again?
The PR is also failing the build tests since you need to add import llama_cpp
in the import try-except block.
Could you also add documentation for the LlamaCpp LM to provide context as done for the LMs in our documentation here?. Would be great to add the example you've tested as well!
Hi @codysnider @randerzander , just following up on this PR to resolve the impending comments so we can merge it soon!
Closed by #1347