djl-serving DJL-TensorRT-LLM Bug: TypeError: Got unsupported ScalarType BFloat16

Description

(A clear and concise description of what the bug is.)

I'm am building the DJL-Serving TensorRT-LLM LMI inference container from scratch, and deploying on Sagemaker Endpoints for Zephyr-7B model. Unfortunately, I run into an error from the tensorrt_llm_toolkit: TypeError: Got unsupported ScalarType BFloat16

Expected Behavior

(what's the expected behavior?) Expected the DJL-Serving Image derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile) to run successfully on Sagemaker Endpoints.

Error Message

(Paste the complete error message, including stack trace.)

2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1480, in covert_and_save
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: weights = convert_hf_llama(
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 1179, in convert_hf_llama
2024-04-25T11:17:00.976-07:00	[INFO ] LmiUtils - convert_py: np.pad(lm_head_weights.detach().cpu().numpy(),
2024-04-25T11:17:01.227-07:00	[INFO ] LmiUtils - convert_py: TypeError: Got unsupported ScalarType BFloat16

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Using TensorRT-LLM inference container derived from here (https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docker/tensorrt-llm.Dockerfile)
Inference Image Pushed to ECR
Model checkpoint for Zephyr-7B compressed as tarball file
Create model on Sagemaker:

from sagemaker.utils import name_from_base

model_name = name_from_base(f"my-model-djl-tensorrt")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": code_artifact,
        "Environment": {
            "ENGINE": "MPI",
            "OPTION_TENSOR_PARALLEL_DEGREE": "8",
            "OPTION_USE_CUSTOM_ALL_REDUCE": "false",
            "OPTION_OUTPUT_FORMATTER": "json",
            "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
            "OPTION_MODEL_LOADING_TIMEOUT": "1000",
            "OPTION_MAX_INPUT_LEN": "5000",
            "OPTION_MAX_OUTPUT_LEN": "1000",
            "OPTION_DTYPE": "bf16"
        }
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

Create endpoint config:

endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response

Create sagemaker endpoint:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Apr 25 '24 19:04 rileyhun

Hi Riley, thanks for raising the issue. It seems like this is most likely an error with the checkpoint conversion script in NVIDIA/TensorRT-LLM, since it is directly loading the weights and converting to numpy on CPU, while BFloat is a gpu-only type. I'd suggest creating a ticket in the TensorRT-LLM repo about this issue.

To work-around this issue in the meantime, you could manually convert and save the model in fp32 before loading it.

Apr 29 '24 17:04 ydm-amazon

Hello @ydm-amazon,

Thanks for following up. I'll check w/ the TensorRT-LLM repo about the issue.

Also wanted to point out that I don't get this issue using the following args in the dockerfile:

ARG djl_version=0.27.0~SNAPSHOT

# Base Deps
ARG cuda_version=cu122
ARG python_version=3.10
ARG torch_version=2.1.0
ARG pydantic_version=2.6.1
ARG cuda_python_version=12.2.0
ARG ammo_version=0.5.0
ARG janus_version=1.0.0
ARG pynvml_version=11.5.0
ARG s5cmd_version=2.2.2

# HF Deps
ARG transformers_version=4.36.2
ARG accelerate_version=0.25.0

# Trtllm Deps
ARG tensorrtlibs_version=9.2.0.post12.dev5
ARG trtllm_toolkit_version=0.7.1
ARG trtllm_version=v0.7.1

Apr 29 '24 18:04 rileyhun

That's right - we know that TensorRT-LLM switched to a different way of loading the model from 0.7.1 to 0.8.0, so that may have caused the issue. We're also looking into our trtllm toolkit 0.8.0 to see if there's something there that may also contribute to the issue.

Apr 29 '24 21:04 ydm-amazon

djl-serving djl-serving copied to clipboard

DJL-TensorRT-LLM Bug: TypeError: Got unsupported ScalarType BFloat16

Description

Expected Behavior

Error Message

How to Reproduce?

djl-serving
djl-serving copied to clipboard