GenerativeAIExamples icon indicating copy to clipboard operation
GenerativeAIExamples copied to clipboard

langchain_nvidia_trt not working

Open rbgo404 opened this issue 2 months ago • 3 comments

I have gone through the notebooks but couldn't able to stream the tokens from the TensorRTLLM. Here's the issue: image

Code used:

from langchain_nvidia_trt.llms import TritonTensorRTLLM
import time
import random

triton_url = "localhost:8001"
pload = {
            'tokens':300,
            'server_url': triton_url,
            'model_name': "ensemble",
            'temperature':1.0,
            'top_k':1,
            'top_p':0,
            'beam_width':1,
            'repetition_penalty':1.0,
            'length_penalty':1.0
}
client = TritonTensorRTLLM(**pload)

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "{system_prompt}"
 "<</SYS>>"
 "[/INST] {context} </s><s>[INST] {question} [/INST]"
)
system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are positive in nature."
context=""
question='What is the fastest land animal?'
prompt = LLAMA_PROMPT_TEMPLATE.format(system_prompt=system_prompt, context=context, question=question)

start_time = time.time()
tokens_generated = 0

for val in client._stream(prompt):
    tokens_generated += 1
    print(val, end="", flush=True)

total_time = time.time() - start_time
print(f"\n--- Generated {tokens_generated} tokens in {total_time} seconds ---")
print(f"--- {tokens_generated/total_time} tokens/sec")

rbgo404 avatar Apr 19 '24 10:04 rbgo404