GenerativeAIExamples langchain_nvidia

I have gone through the notebooks but couldn't able to stream the tokens from the TensorRTLLM. Here's the issue:

Code used:

from langchain_nvidia_trt.llms import TritonTensorRTLLM
import time
import random

triton_url = "localhost:8001"
pload = {
            'tokens':300,
            'server_url': triton_url,
            'model_name': "ensemble",
            'temperature':1.0,
            'top_k':1,
            'top_p':0,
            'beam_width':1,
            'repetition_penalty':1.0,
            'length_penalty':1.0
}
client = TritonTensorRTLLM(**pload)

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "{system_prompt}"
 "<</SYS>>"
 "[/INST] {context} </s><s>[INST] {question} [/INST]"
)
system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are positive in nature."
context=""
question='What is the fastest land animal?'
prompt = LLAMA_PROMPT_TEMPLATE.format(system_prompt=system_prompt, context=context, question=question)

start_time = time.time()
tokens_generated = 0

for val in client._stream(prompt):
    tokens_generated += 1
    print(val, end="", flush=True)

total_time = time.time() - start_time
print(f"\n--- Generated {tokens_generated} tokens in {total_time} seconds ---")
print(f"--- {tokens_generated/total_time} tokens/sec")

Apr 19 '24 10:04 rbgo404

Please share the configuration in the TensorRT-LLM end. What are the parameters modification required in the model's config.pbtxt

Apr 19 '24 10:04 rbgo404

Hey @rbgo404 You can deploy the tensorRT-based LLM model by following the steps here https://nvidia.github.io/GenerativeAIExamples/latest/local-gpu.html#using-local-gpus-for-a-q-a-chatbot

This notebook interacts with the model deployed behind llm-inference-server container which should get started up if you follow the steps above.

Let me know if you have any questions once you go through these steps!

Apr 22 '24 13:04 shubhadeepd

Hi, I followed the instruction but still has problem starting llm-inference-server. I'm currently using Tesla M60 and llama-2-13b-chat Screenshot from 2024-04-30 23-08-17

May 01 '24 06:05 ChiBerkeley

GenerativeAIExamples
GenerativeAIExamples copied to clipboard

langchain_nvidia_trt not working

GenerativeAIExamples GenerativeAIExamples copied to clipboard

langchain_nvidia_trt not working

GenerativeAIExamples
GenerativeAIExamples copied to clipboard