TensorRT-LLM TensorRT-LLM Conversion Script Bug: TypeError: Got unsupported ScalarType BFloat16

TensorRT-LLM Conversion Script Bug: TypeError: Got unsupported ScalarType BFloat16

Open rileyhun opened this issue 9 months ago • 0 comments

System Info

Description

I'm am building the DJL-Serving TensorRT-LLM LMI inference container from scratch, and deploying on Sagemaker Endpoints for Zephyr-7B model. Unfortunately, I run into an error from the tensorrt_llm_toolkit: TypeError: Got unsupported ScalarType BFloat16. It seems like this is most likely an error with the checkpoint conversion script in NVIDIA/TensorRT-LLM, since it is directly loading the weights and converting to numpy on CPU, while BFloat is a gpu-only type.

System Info: GPU: ml.g5.48xlarge (8 A10 GPUs on Sagemaker endpoints) OS: Ubuntu 22.04 LTS Model: Zephyr-7B Beta

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Using TensorRT-LLM inference container derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile)
Inference Image Pushed to ECR
Model checkpoint for Zephyr-7B compressed as tarball file
Create model on Sagemaker:

from sagemaker.utils import name_from_base

model_name = name_from_base(f"my-model-djl-tensorrt")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": code_artifact,
        "Environment": {
            "ENGINE": "MPI",
            "OPTION_TENSOR_PARALLEL_DEGREE": "8",
            "OPTION_USE_CUSTOM_ALL_REDUCE": "false",
            "OPTION_OUTPUT_FORMATTER": "json",
            "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
            "OPTION_MODEL_LOADING_TIMEOUT": "1000",
            "OPTION_MAX_INPUT_LEN": "5000",
            "OPTION_MAX_OUTPUT_LEN": "1000",
            "OPTION_DTYPE": "bf16"
        }
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

Create endpoint config:

endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response

Create sagemaker endpoint:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Expected behavior

Expected the DJL-Serving Image derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile) to run successfully on Sagemaker Endpoints.

IMPORTANT: An older version of the DJL-Serving TensorRT-LLM container works. These are the args I used to get it working:

ARG djl_version=0.27.0~SNAPSHOT

# Base Deps
ARG cuda_version=cu122
ARG python_version=3.10
ARG torch_version=2.1.0
ARG pydantic_version=2.6.1
ARG cuda_python_version=12.2.0
ARG ammo_version=0.5.0
ARG janus_version=1.0.0
ARG pynvml_version=11.5.0
ARG s5cmd_version=2.2.2

# HF Deps
ARG transformers_version=4.36.2
ARG accelerate_version=0.25.0

# Trtllm Deps
ARG tensorrtlibs_version=9.2.0.post12.dev5
ARG trtllm_toolkit_version=0.7.1
ARG trtllm_version=v0.7.1

actual behavior

2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1480, in covert_and_save
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: weights = convert_hf_llama(
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1179, in convert_hf_llama
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: np.pad(lm_head_weights.detach().cpu().numpy(),
2024-04-25T11:17:01.227-07:00	[INFO ] LmiUtils - convert_py: TypeError: Got unsupported ScalarType BFloat16

additional notes

N/A

Apr 29 '24 18:04 rileyhun

TensorRT-LLM TensorRT-LLM copied to clipboard

TensorRT-LLM Conversion Script Bug: TypeError: Got unsupported ScalarType BFloat16

System Info

Description

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard