djl-serving
djl-serving copied to clipboard
DJL-TRTLLM: Error while detokenizing output response of teknium/OpenHermes-2.5-Mistral-7B on Sagemaker
Description
I followed the recipe given here to manually convert teknium/OpenHermes-2.5-Mistral-7B to tensorrt on sagemaker's ml.g5.4xlarge and deploy the compiled model saved on s3 on sagemaker endpoint using ml.g5.2xlarge (only cpu and ram are different). When i invoke the endpoint simply using
import boto3
import json
runtime = boto3.client("sagemaker-runtime")
endpoint_name = "djl-trtllm-endpoint"
content_type = "application/json"
payload = json.dumps({"inputs": "hey", "parameters": {}})
response = runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=content_type,
Body=payload)
I receive the following error log:
Error Message
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:Rolling batch inference error
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:Traceback (most recent call last):
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: File "/tmp/.djl.ai/python/0.26.0/djl_python/rolling_batch/rolling_batch.py", line 189, in try_catch_handling
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: return func(self, input_data, parameters)
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: File "/tmp/.djl.ai/python/0.26.0/djl_python/rolling_batch/trtllm_rolling_batch.py", line 80, in inference
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: generation = trt_resp.fetch()
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/detoknized_triton_repsonse.py", line 69, in fetch
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: self.decode_token(), len(self.all_input_ids), complete)
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/detoknized_triton_repsonse.py", line 45, in decode_token
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: new_text = self.tokenizer.decode(
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 3750, in decode
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: return self._decode(
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 625, in _decode
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>: text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
[INFO ] PyProcess - W-350-model-stdout: [1,0]<stdout>:TypeError: argument 'ids': 'list' object cannot be interpreted as an integer
I assume the error is coming from giving a list of lists to the _tokenizer.decode function instead of just a list of input_ids. Can someone help me understand why this happens ?
could you share which DJLServing or LMI version you are using?