Expected Behavior

Model Deployed Successfully

Actual Behavior

I am getting this error -

INFO:google.cloud.aiplatform.models:Creating Endpoint INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/81995035742/locations/us-central1/endpoints/6941658909824253952/operations/4744238776585289728 Using model from: gs://19865_finetuned_models/gemma-keras-lora-train_20240308_200536 INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/81995035742/locations/us-central1/endpoints/6941658909824253952 INFO:google.cloud.aiplatform.models:To use this Endpoint in another session: INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/81995035742/locations/us-central1/endpoints/6941658909824253952') INFO:google.cloud.aiplatform.models:Creating Model INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/81995035742/locations/us-central1/models/6359646723312189440/operations/7818789947195785216 INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/81995035742/locations/us-central1/models/6359646723312189440@1 INFO:google.cloud.aiplatform.models:To use this Model in another session: INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/81995035742/locations/us-central1/models/6359646723312189440@1') INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/81995035742/locations/us-central1/endpoints/6941658909824253952

InactiveRpcError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py in error_remapped_callable(*args, **kwargs) 71 try: ---> 72 return callable(*args, **kwargs) 73 except grpc.RpcError as exc:

11 frames _InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "Machine type "ct4p-hightpu-4t" is not supported." debug_error_string = "UNKNOWN:Error received from peer ipv4:173.194.196.95:443 {created_time:"2024-03-08T21:11:42.821027279+00:00", grpc_status:3, grpc_message:"Machine type "ct4p-hightpu-4t" is not supported."}"

The above exception was the direct cause of the following exception:

InvalidArgument Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py in error_remapped_callable(*args, **kwargs) 72 return callable_(*args, **kwargs) 73 except grpc.RpcError as exc: ---> 74 raise exceptions.from_grpc_error(exc) from exc 75 76 return error_remapped_callable

InvalidArgument: 400 Machine type "ct4p-hightpu-4t" is not supported.

Steps to Reproduce the Problem

Run this code - # @title Deploy

@markdown This section uploads the model to Model Registry and deploys it on the Endpoint. It takes 15 minutes to 1 hour to finish.

@markdown Hex-LLM is a High-Efficiency Large Language Model (LLM) TPU serving solution built with XLA, which is being developed by Google Cloud. This notebook uses TPU v5e machines. Click `Show code` to see more details.

if LOAD_MODEL_FROM != "Kaggle": print("Skipped: Expect to load model from Kaggle, got", LOAD_MODEL_FROM) else: if "2b" in KAGGLE_MODEL_ID: # Sets ct5lp-hightpu-1t (1 TPU chip) to deploy Gemma 2B models. machine_type = "ct5lp-hightpu-1t" else: # Sets ct5lp-hightpu-4t (4 TPU chips) to deploy Gemma 7B models. machine_type = "ct4p-hightpu-4t"

# Note that a larger max_num_batched_tokens will require more TPU memory.
max_num_batched_tokens = 11264
# Multiple of tokens for padding alignment. A higher value can reduce
# re-compilation but can also increase the waste in computation.
tokens_pad_multiple = 1024
# Multiple of sequences for padding alignment. A higher value can reduce
# re-compilation but can also increase the waste in computation.
seqs_pad_multiple = 32

print("Using model from: ", output_folder)
model, endpoint = deploy_model_hexllm(
    model_name=get_job_name_with_datetime(prefix="gemma-serve-hexllm"),
    base_model_id=f"google/{KAGGLE_MODEL_ID}",
    model_id=output_folder,
    service_account=SERVICE_ACCOUNT,
    machine_type=machine_type,
    max_num_batched_tokens=max_num_batched_tokens,
    tokens_pad_multiple=tokens_pad_multiple,
    seqs_pad_multiple=seqs_pad_multiple,
)
print("endpoint_name:", endpoint.name)

Specifications

Version:
Platform: colab enterprise

Mar 08 '24 21:03 ariji1

@KCFindstr: can you please take a look at this issue? Thanks.

Mar 11 '24 13:03 gericdong

From the original notebook, the correct machine type is ct5lp-hightpu-4t - You might have accidentally modified the machine type. Please let me know if ct5lp-hightpu-4t works.

Mar 11 '24 20:03 KCFindstr

I had originally tried with ct5lp-hightpu-4t It still gives the same error, also ct5lp-hightpu-1t is giving the same error. I have also tried raising a support request but there is no issue with permissions.

Mar 12 '24 03:03 ariji1

Still getting the same error

Mar 12 '24 04:03 ariji1

Hi @kathyyu-google , would you please take a look at this hex-llm deployment failure?

Mar 12 '24 18:03 KCFindstr

Based on the endpoint ID from the logs (projects/81995035742/locations/us-central1/endpoints/6941658909824253952), this endpoint was created in the us-central1 region. TPU deployment is supported only in the us-west1 region. Please update the variable REGION and re-attempt the deployment. Please also verify that there is available TPU quota (see the "Request for TPU quota" section of the notebook for more details).

ct5lp-hightpu-*t is the expected machine type here, such as ct5lp-hightpu-1t and ct5lp-hightpu-4t.

Mar 15 '24 02:03 kathyyu-google

Is Hex-LLM opensource or private and can it be used with Cloud TPU API ?

Jul 20 '24 13:07 vanshi3214

Is Hex-LLM opensource or private and can it be used with Cloud TPU API ?

Hex-LLM is a closed-source serving solution. It can't be used directly with Cloud TPU; it can be used with deployments on Vertex Online Prediction. Example notebooks include https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_gemma2_deployment_on_vertex.ipynb.

Aug 16 '24 00:08 kathyyu-google

Closing this issue as the original question about TPU machine type and region has been answered. Please reopen if needed, thanks!

Aug 16 '24 00:08 kathyyu-google

Closing this issue as the original question about TPU machine type and region has been answered. Please reopen if needed, thanks!

Very good solution , but I think it still doesn't support most models, got error with Deepseek 7b model which is based on LLAMA 2 architecture, also it doesn't support Qwen2 models which are also very good, are you considering adding more models support in the future

Aug 16 '24 00:08 vanshi3214

Thank you @vanshi3214 for the suggestion! Currently it supports Gemma, Llama, Mistral and Mixtral models. We will support more models in the future and take your suggestion into consideration.

Aug 16 '24 00:08 kathyyu-google

vertex-ai-samples
vertex-ai-samples copied to clipboard

Issue with deploying via Hex-LLM, TPU serving solution built with XLA, which is being developed by Google Cloud.

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

@markdown This section uploads the model to Model Registry and deploys it on the Endpoint. It takes 15 minutes to 1 hour to finish.

@markdown Hex-LLM is a High-Efficiency Large Language Model (LLM) TPU serving solution built with XLA, which is being developed by Google Cloud. This notebook uses TPU v5e machines. Click `Show code` to see more details.

Specifications

vertex-ai-samples vertex-ai-samples copied to clipboard

Issue with deploying via Hex-LLM, TPU serving solution built with XLA, which is being developed by Google Cloud.

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

@markdown This section uploads the model to Model Registry and deploys it on the Endpoint. It takes 15 minutes to 1 hour to finish.

@markdown Hex-LLM is a High-Efficiency Large Language Model (LLM) TPU serving solution built with XLA, which is being developed by Google Cloud. This notebook uses TPU v5e machines. Click Show code to see more details.

Specifications

vertex-ai-samples
vertex-ai-samples copied to clipboard

@markdown Hex-LLM is a High-Efficiency Large Language Model (LLM) TPU serving solution built with XLA, which is being developed by Google Cloud. This notebook uses TPU v5e machines. Click `Show code` to see more details.