vertex-ai-samples icon indicating copy to clipboard operation
vertex-ai-samples copied to clipboard

Issue with deploying via Hex-LLM, TPU serving solution built with XLA, which is being developed by Google Cloud.

Open ariji1 opened this issue 11 months ago • 6 comments

Expected Behavior

Model Deployed Successfully

Actual Behavior

I am getting this error -

INFO:google.cloud.aiplatform.models:Creating Endpoint INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/81995035742/locations/us-central1/endpoints/6941658909824253952/operations/4744238776585289728 Using model from: gs://19865_finetuned_models/gemma-keras-lora-train_20240308_200536 INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/81995035742/locations/us-central1/endpoints/6941658909824253952 INFO:google.cloud.aiplatform.models:To use this Endpoint in another session: INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/81995035742/locations/us-central1/endpoints/6941658909824253952') INFO:google.cloud.aiplatform.models:Creating Model INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/81995035742/locations/us-central1/models/6359646723312189440/operations/7818789947195785216 INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/81995035742/locations/us-central1/models/6359646723312189440@1 INFO:google.cloud.aiplatform.models:To use this Model in another session: INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/81995035742/locations/us-central1/models/6359646723312189440@1') INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/81995035742/locations/us-central1/endpoints/6941658909824253952

InactiveRpcError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py in error_remapped_callable(*args, **kwargs) 71 try: ---> 72 return callable(*args, **kwargs) 73 except grpc.RpcError as exc:

11 frames _InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "Machine type "ct4p-hightpu-4t" is not supported." debug_error_string = "UNKNOWN:Error received from peer ipv4:173.194.196.95:443 {created_time:"2024-03-08T21:11:42.821027279+00:00", grpc_status:3, grpc_message:"Machine type "ct4p-hightpu-4t" is not supported."}"

The above exception was the direct cause of the following exception:

InvalidArgument Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py in error_remapped_callable(*args, **kwargs) 72 return callable_(*args, **kwargs) 73 except grpc.RpcError as exc: ---> 74 raise exceptions.from_grpc_error(exc) from exc 75 76 return error_remapped_callable

InvalidArgument: 400 Machine type "ct4p-hightpu-4t" is not supported.

Steps to Reproduce the Problem

Run this code - # @title Deploy

@markdown This section uploads the model to Model Registry and deploys it on the Endpoint. It takes 15 minutes to 1 hour to finish.

@markdown Hex-LLM is a High-Efficiency Large Language Model (LLM) TPU serving solution built with XLA, which is being developed by Google Cloud. This notebook uses TPU v5e machines. Click Show code to see more details.

if LOAD_MODEL_FROM != "Kaggle": print("Skipped: Expect to load model from Kaggle, got", LOAD_MODEL_FROM) else: if "2b" in KAGGLE_MODEL_ID: # Sets ct5lp-hightpu-1t (1 TPU chip) to deploy Gemma 2B models. machine_type = "ct5lp-hightpu-1t" else: # Sets ct5lp-hightpu-4t (4 TPU chips) to deploy Gemma 7B models. machine_type = "ct4p-hightpu-4t"

# Note that a larger max_num_batched_tokens will require more TPU memory.
max_num_batched_tokens = 11264
# Multiple of tokens for padding alignment. A higher value can reduce
# re-compilation but can also increase the waste in computation.
tokens_pad_multiple = 1024
# Multiple of sequences for padding alignment. A higher value can reduce
# re-compilation but can also increase the waste in computation.
seqs_pad_multiple = 32

print("Using model from: ", output_folder)
model, endpoint = deploy_model_hexllm(
    model_name=get_job_name_with_datetime(prefix="gemma-serve-hexllm"),
    base_model_id=f"google/{KAGGLE_MODEL_ID}",
    model_id=output_folder,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    max_num_batched_tokens=max_num_batched_tokens,
    tokens_pad_multiple=tokens_pad_multiple,
    seqs_pad_multiple=seqs_pad_multiple,
)
print("endpoint_name:", endpoint.name)

Specifications

  • Version:
  • Platform: colab enterprise

ariji1 avatar Mar 08 '24 21:03 ariji1