TensorRT-LLM
TensorRT-LLM copied to clipboard
TensorRT-LLM Conversion Script Bug: TypeError: Got unsupported ScalarType BFloat16
System Info
Description
I'm am building the DJL-Serving TensorRT-LLM LMI inference container from scratch, and deploying on Sagemaker Endpoints for Zephyr-7B model. Unfortunately, I run into an error from the tensorrt_llm_toolkit
: TypeError: Got unsupported ScalarType BFloat16
. It seems like this is most likely an error with the checkpoint conversion script in NVIDIA/TensorRT-LLM, since it is directly loading the weights and converting to numpy on CPU, while BFloat is a gpu-only type.
System Info: GPU: ml.g5.48xlarge (8 A10 GPUs on Sagemaker endpoints) OS: Ubuntu 22.04 LTS Model: Zephyr-7B Beta
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- Using TensorRT-LLM inference container derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile)
- Inference Image Pushed to ECR
- Model checkpoint for Zephyr-7B compressed as tarball file
- Create model on Sagemaker:
from sagemaker.utils import name_from_base
model_name = name_from_base(f"my-model-djl-tensorrt")
print(model_name)
create_model_response = sm_client.create_model(
ModelName=model_name,
ExecutionRoleArn=role,
PrimaryContainer={
"Image": inference_image_uri,
"ModelDataUrl": code_artifact,
"Environment": {
"ENGINE": "MPI",
"OPTION_TENSOR_PARALLEL_DEGREE": "8",
"OPTION_USE_CUSTOM_ALL_REDUCE": "false",
"OPTION_OUTPUT_FORMATTER": "json",
"OPTION_MAX_ROLLING_BATCH_SIZE": "16",
"OPTION_MODEL_LOADING_TIMEOUT": "1000",
"OPTION_MAX_INPUT_LEN": "5000",
"OPTION_MAX_OUTPUT_LEN": "1000",
"OPTION_DTYPE": "bf16"
}
},
)
model_arn = create_model_response["ModelArn"]
print(f"Created Model: {model_arn}")
- Create endpoint config:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"
endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
"VariantName": "variant1",
"ModelName": model_name,
"InstanceType": instance_type,
"InitialInstanceCount": 1,
"ModelDataDownloadTimeoutInSeconds": 2400,
"ContainerStartupHealthCheckTimeoutInSeconds": 2400,
},
],
)
endpoint_config_response
- Create sagemaker endpoint:
create_endpoint_response = sm_client.create_endpoint(
EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")
Expected behavior
Expected the DJL-Serving Image derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile) to run successfully on Sagemaker Endpoints.
IMPORTANT: An older version of the DJL-Serving TensorRT-LLM container works. These are the args I used to get it working:
ARG djl_version=0.27.0~SNAPSHOT
# Base Deps
ARG cuda_version=cu122
ARG python_version=3.10
ARG torch_version=2.1.0
ARG pydantic_version=2.6.1
ARG cuda_python_version=12.2.0
ARG ammo_version=0.5.0
ARG janus_version=1.0.0
ARG pynvml_version=11.5.0
ARG s5cmd_version=2.2.2
# HF Deps
ARG transformers_version=4.36.2
ARG accelerate_version=0.25.0
# Trtllm Deps
ARG tensorrtlibs_version=9.2.0.post12.dev5
ARG trtllm_toolkit_version=0.7.1
ARG trtllm_version=v0.7.1
actual behavior
2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1480, in covert_and_save
2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: weights = convert_hf_llama(
2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1179, in convert_hf_llama
2024-04-25T11:17:00.976-07:00 [INFO ] LmiUtils - convert_py: np.pad(lm_head_weights.detach().cpu().numpy(),
2024-04-25T11:17:01.227-07:00 [INFO ] LmiUtils - convert_py: TypeError: Got unsupported ScalarType BFloat16
additional notes
N/A