langchain
langchain copied to clipboard
langserve with Llamacpp
I am trying to use langserve with langchain llamacpp like this (chain.py):
from langchain.chains import RetrievalQA,ConversationalRetrievalChain
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from prompt_template_utils import get_prompt_template
from langchain.vectorstores import Chroma, FAISS
from werkzeug.utils import secure_filename
from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY, MODEL_ID, MODEL_BASENAME
DEVICE_TYPE = "cuda" if torch.cuda.is_available() else "cpu"
SHOW_SOURCES = True
logging.info(f"Running on: {DEVICE_TYPE}")
logging.info(f"Display Source Documents set to: {SHOW_SOURCES}")
MODEL_PATH = '../llama-2-7b-32k-instruct.Q6_K.gguf'
def load_model():
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
n_gpu_layers = 30 # Change this value based on your model and your GPU VRAM pool.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
llm = LlamaCpp(
model_path=MODEL_PATH,
n_ctx = 8192,
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
repeat_penalty=1,
temperature=0.2,
max_tokens=100,
top_p=0.9,
top_k = 50,
rope_freq_scale=0.125,
stop = ["[INST]"],
# callback_manager=callback_manager,
streaming=True,
verbose=True, # Verbose is required to pass to the callback manager
)
return llm
EMBEDDINGS = HuggingFaceInstructEmbeddings(model_name=EMBEDDING_MODEL_NAME,
model_kwargs={"device": DEVICE_TYPE})
DB = FAISS.load_local(PERSIST_DIRECTORY, EMBEDDINGS)
RETRIEVER = DB.as_retriever(search_kwargs={'k': 16})
template = """\
[INST] <<SYS>>
{context}
<</SYS>>
[/INST]
[INST]{question}[/INST]
"""
LLM = load_model()#load_model(device_type=DEVICE_TYPE, model_id=MODEL_ID, model_basename=MODEL_BASENAME)
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)
QA = ConversationalRetrievalChain.from_llm(LLM, retriever=RETRIEVER,
combine_docs_chain_kwargs={"prompt": QA_CHAIN_PROMPT})
The app.py file has this:
from fastapi import FastAPI
from langserve import add_routes
from chain import QA
app = FastAPI(title="Retrieval App")
add_routes(app, QA)
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)`
However when running /stream route, I get this warning:
NFO: Started server process [491117]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:48024 - "GET /docs HTTP/1.1" 200 OK
INFO: 127.0.0.1:48024 - "GET /openapi.json HTTP/1.1" 200 OK
INFO: 127.0.0.1:53430 - "POST /stream HTTP/1.1" 200 OK
/home/zeeshan/miniconda3/envs/llama/lib/python3.10/site-packages/langchain/llms/llamacpp.py:352: RuntimeWarning: coroutine 'AsyncCallbackManagerForLLMRun.on_llm_new_token' was never awaited
run_manager.on_llm_new_token(
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
And the output is like this:
This seems not to be streaming output tokens. Does langserve even support langchain llamacpp or I am doing something wrong?
the run time warning points to the fact that this is async related:
llamacpp.py:352: RuntimeWarning: coroutine 'AsyncCallbackManagerForLLMRun.on_llm_new_token' was never awaited
I'll take a look in a bit -- if i had to bet a bug in langchain or issue with implementation in llamacpp for streaming
yes, I have been trying to implement langchain llamacpp streaming through fast api and all the solutions I have implemented so far give the same run time warning without giving streaming tokens as the output
Heya! sorry didn't manage to tackle this today.
Wondering could you double check this:
Remove these:
# callback_manager=callback_manager,
streaming=True,
verbose=True, # Verbose is required to pass to the callback manager
And could you check what happens when you do:
Check whether the async version of streaming is working (that's what the server is using)
async for chunk in model.astream(prompt):
print(chunk.content, end="", flush=True)
Remove these: Still getting no streaming output
As for this:
async for chunk in model.astream(prompt): print(chunk.content, end="", flush=True)
I still get the response printed all at once without streaming tokens
If this is not working, then it's not a langserve issue, but an issue with the underlying model implementaiton. I'll transfer the issue to langchain
async for chunk in model.astream(prompt):
print(chunk.content, end="", flush=True)
yes, it is related to langchain.llamacpp
I could make llamacpp work with langserve by applying #9177 (or #10908) and adding parameter chunk=chunk
to run_manager.on_llm_new_token()
in _astream()
.
I could make llamacpp work with langserve by applying #9177 (or #10908) and adding parameter
chunk=chunk
torun_manager.on_llm_new_token()
in_astream()
.
Can you share your complete solution? I am also have difficulties with streaming llamacpp on Langserve or FastAPI
I could make llamacpp work with langserve by applying #9177 (or #10908) and adding parameter
chunk=chunk
torun_manager.on_llm_new_token()
in_astream()
.Can you share your complete solution? I am also have difficulties with streaming llamacpp on Langserve or FastAPI
@weissenbacherpwc You will need following changes:
- adding async call for LlamaCpp (#9177, but it's old and conflicts with latest master)
- adding
chunk
parameter to run_manager.on_llm_new_token().
Because #9177 won't be merged and llamacpp.py
of langchain has been moved to langchain_community, I made the changes on my branch. Please merge it if you need.
my branch
thanks @akionux ! Does this ony work with Langserve or could this also work only with FastAPI?
thanks @akionux ! Does this ony work with Langserve or could this also work only with FastAPI?
I checked the langserve playground works, so it may work with FastAPI.