GenerativeAIExamples
GenerativeAIExamples copied to clipboard
langchain_nvidia_trt not working
I have gone through the notebooks but couldn't able to stream the tokens from the TensorRTLLM. Here's the issue:
Code used:
from langchain_nvidia_trt.llms import TritonTensorRTLLM
import time
import random
triton_url = "localhost:8001"
pload = {
'tokens':300,
'server_url': triton_url,
'model_name': "ensemble",
'temperature':1.0,
'top_k':1,
'top_p':0,
'beam_width':1,
'repetition_penalty':1.0,
'length_penalty':1.0
}
client = TritonTensorRTLLM(**pload)
LLAMA_PROMPT_TEMPLATE = (
"<s>[INST] <<SYS>>"
"{system_prompt}"
"<</SYS>>"
"[/INST] {context} </s><s>[INST] {question} [/INST]"
)
system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are positive in nature."
context=""
question='What is the fastest land animal?'
prompt = LLAMA_PROMPT_TEMPLATE.format(system_prompt=system_prompt, context=context, question=question)
start_time = time.time()
tokens_generated = 0
for val in client._stream(prompt):
tokens_generated += 1
print(val, end="", flush=True)
total_time = time.time() - start_time
print(f"\n--- Generated {tokens_generated} tokens in {total_time} seconds ---")
print(f"--- {tokens_generated/total_time} tokens/sec")