langchain langserve with Llamacpp

I am trying to use langserve with langchain llamacpp like this (chain.py):

from langchain.chains import RetrievalQA,ConversationalRetrievalChain
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler


from prompt_template_utils import get_prompt_template


from langchain.vectorstores import Chroma, FAISS
from werkzeug.utils import secure_filename

from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY, MODEL_ID, MODEL_BASENAME

DEVICE_TYPE = "cuda" if torch.cuda.is_available() else "cpu"
SHOW_SOURCES = True
logging.info(f"Running on: {DEVICE_TYPE}")
logging.info(f"Display Source Documents set to: {SHOW_SOURCES}")

MODEL_PATH = '../llama-2-7b-32k-instruct.Q6_K.gguf'

def load_model():

    callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

    n_gpu_layers = 30  # Change this value based on your model and your GPU VRAM pool.
    n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

    llm = LlamaCpp(
            model_path=MODEL_PATH,
            n_ctx = 8192,
            n_gpu_layers=n_gpu_layers,
            n_batch=n_batch,
            repeat_penalty=1,
            temperature=0.2,
            max_tokens=100,
            top_p=0.9,
            top_k = 50,
            rope_freq_scale=0.125,
            stop = ["[INST]"],
            # callback_manager=callback_manager, 
            streaming=True,
            verbose=True, # Verbose is required to pass to the callback manager
            )

    return llm

EMBEDDINGS = HuggingFaceInstructEmbeddings(model_name=EMBEDDING_MODEL_NAME, 
                                            model_kwargs={"device": DEVICE_TYPE})

DB = FAISS.load_local(PERSIST_DIRECTORY, EMBEDDINGS)

RETRIEVER = DB.as_retriever(search_kwargs={'k': 16})

template = """\
[INST] <<SYS>>


{context}

<</SYS>>


[/INST]
[INST]{question}[/INST]
"""

LLM = load_model()#load_model(device_type=DEVICE_TYPE, model_id=MODEL_ID, model_basename=MODEL_BASENAME)

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

QA = ConversationalRetrievalChain.from_llm(LLM, retriever=RETRIEVER,
                                           combine_docs_chain_kwargs={"prompt": QA_CHAIN_PROMPT})

The app.py file has this:


from fastapi import FastAPI
from langserve import add_routes

from chain import QA

app = FastAPI(title="Retrieval App")

add_routes(app, QA)

if __name__ == "__main__":
    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=8000)`

However when running /stream route, I get this warning:

NFO:     Started server process [491117]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:48024 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:48024 - "GET /openapi.json HTTP/1.1" 200 OK
INFO:     127.0.0.1:53430 - "POST /stream HTTP/1.1" 200 OK
/home/zeeshan/miniconda3/envs/llama/lib/python3.10/site-packages/langchain/llms/llamacpp.py:352: RuntimeWarning: coroutine 'AsyncCallbackManagerForLLMRun.on_llm_new_token' was never awaited
  run_manager.on_llm_new_token(
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

And the output is like this:

This seems not to be streaming output tokens. Does langserve even support langchain llamacpp or I am doing something wrong?

Oct 26 '23 10:10 hamaadtahiir

the run time warning points to the fact that this is async related:

llamacpp.py:352: RuntimeWarning: coroutine 'AsyncCallbackManagerForLLMRun.on_llm_new_token' was never awaited

I'll take a look in a bit -- if i had to bet a bug in langchain or issue with implementation in llamacpp for streaming

Oct 26 '23 13:10 eyurtsev

yes, I have been trying to implement langchain llamacpp streaming through fast api and all the solutions I have implemented so far give the same run time warning without giving streaming tokens as the output

Oct 26 '23 14:10 hamaadtahiir

Heya! sorry didn't manage to tackle this today.

Wondering could you double check this:

Remove these:

            # callback_manager=callback_manager, 
            streaming=True,
            verbose=True, # Verbose is required to pass to the callback manager

And could you check what happens when you do:

Check whether the async version of streaming is working (that's what the server is using)

async for chunk in model.astream(prompt):
    print(chunk.content, end="", flush=True)

Oct 27 '23 02:10 eyurtsev

Remove these: Still getting no streaming output

As for this: async for chunk in model.astream(prompt): print(chunk.content, end="", flush=True)

I still get the response printed all at once without streaming tokens

Oct 27 '23 06:10 hamaadtahiir

If this is not working, then it's not a langserve issue, but an issue with the underlying model implementaiton. I'll transfer the issue to langchain

async for chunk in model.astream(prompt): 
    print(chunk.content, end="", flush=True)

Oct 27 '23 18:10 eyurtsev

yes, it is related to langchain.llamacpp

Oct 27 '23 18:10 hamaadtahiir

I could make llamacpp work with langserve by applying #9177 (or #10908) and adding parameter chunk=chunk to run_manager.on_llm_new_token() in _astream().

Nov 09 '23 13:11 akionux

I could make llamacpp work with langserve by applying #9177 (or #10908) and adding parameter chunk=chunk to run_manager.on_llm_new_token() in _astream().

Can you share your complete solution? I am also have difficulties with streaming llamacpp on Langserve or FastAPI

Jan 24 '24 18:01 weissenbacherpwc

I could make llamacpp work with langserve by applying #9177 (or #10908) and adding parameter chunk=chunk to run_manager.on_llm_new_token() in _astream().

Can you share your complete solution? I am also have difficulties with streaming llamacpp on Langserve or FastAPI

@weissenbacherpwc You will need following changes:

adding async call for LlamaCpp (#9177, but it's old and conflicts with latest master)
adding chunk parameter to run_manager.on_llm_new_token().

Because #9177 won't be merged and llamacpp.py of langchain has been moved to langchain_community, I made the changes on my branch. Please merge it if you need.

Jan 25 '24 03:01 akionux

my branch

thanks @akionux ! Does this ony work with Langserve or could this also work only with FastAPI?

Jan 25 '24 06:01 weissenbacherpwc

thanks @akionux ! Does this ony work with Langserve or could this also work only with FastAPI?

I checked the langserve playground works, so it may work with FastAPI.

Jan 25 '24 12:01 akionux

langchain langchain copied to clipboard

langserve with Llamacpp

langchain
langchain copied to clipboard