vertex-ai-samples
vertex-ai-samples copied to clipboard
Issue with deploying via Hex-LLM, TPU serving solution built with XLA, which is being developed by Google Cloud.
Expected Behavior
Model Deployed Successfully
Actual Behavior
I am getting this error -
INFO:google.cloud.aiplatform.models:Creating Endpoint INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/81995035742/locations/us-central1/endpoints/6941658909824253952/operations/4744238776585289728 Using model from: gs://19865_finetuned_models/gemma-keras-lora-train_20240308_200536 INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/81995035742/locations/us-central1/endpoints/6941658909824253952 INFO:google.cloud.aiplatform.models:To use this Endpoint in another session: INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/81995035742/locations/us-central1/endpoints/6941658909824253952') INFO:google.cloud.aiplatform.models:Creating Model INFO:google.cloud.aiplatform.models:Create Model backing LRO: projects/81995035742/locations/us-central1/models/6359646723312189440/operations/7818789947195785216 INFO:google.cloud.aiplatform.models:Model created. Resource name: projects/81995035742/locations/us-central1/models/6359646723312189440@1 INFO:google.cloud.aiplatform.models:To use this Model in another session: INFO:google.cloud.aiplatform.models:model = aiplatform.Model('projects/81995035742/locations/us-central1/models/6359646723312189440@1') INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/81995035742/locations/us-central1/endpoints/6941658909824253952
InactiveRpcError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py in error_remapped_callable(*args, **kwargs) 71 try: ---> 72 return callable(*args, **kwargs) 73 except grpc.RpcError as exc:
11 frames _InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "Machine type "ct4p-hightpu-4t" is not supported." debug_error_string = "UNKNOWN:Error received from peer ipv4:173.194.196.95:443 {created_time:"2024-03-08T21:11:42.821027279+00:00", grpc_status:3, grpc_message:"Machine type "ct4p-hightpu-4t" is not supported."}"
The above exception was the direct cause of the following exception:
InvalidArgument Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/google/api_core/grpc_helpers.py in error_remapped_callable(*args, **kwargs) 72 return callable_(*args, **kwargs) 73 except grpc.RpcError as exc: ---> 74 raise exceptions.from_grpc_error(exc) from exc 75 76 return error_remapped_callable
InvalidArgument: 400 Machine type "ct4p-hightpu-4t" is not supported.
Steps to Reproduce the Problem
Run this code - # @title Deploy
@markdown This section uploads the model to Model Registry and deploys it on the Endpoint. It takes 15 minutes to 1 hour to finish.
@markdown Hex-LLM is a High-Efficiency Large Language Model (LLM) TPU serving solution built with XLA, which is being developed by Google Cloud. This notebook uses TPU v5e machines. Click Show code to see more details.
if LOAD_MODEL_FROM != "Kaggle": print("Skipped: Expect to load model from Kaggle, got", LOAD_MODEL_FROM) else: if "2b" in KAGGLE_MODEL_ID: # Sets ct5lp-hightpu-1t (1 TPU chip) to deploy Gemma 2B models. machine_type = "ct5lp-hightpu-1t" else: # Sets ct5lp-hightpu-4t (4 TPU chips) to deploy Gemma 7B models. machine_type = "ct4p-hightpu-4t"
# Note that a larger max_num_batched_tokens will require more TPU memory.
max_num_batched_tokens = 11264
# Multiple of tokens for padding alignment. A higher value can reduce
# re-compilation but can also increase the waste in computation.
tokens_pad_multiple = 1024
# Multiple of sequences for padding alignment. A higher value can reduce
# re-compilation but can also increase the waste in computation.
seqs_pad_multiple = 32
print("Using model from: ", output_folder)
model, endpoint = deploy_model_hexllm(
model_name=get_job_name_with_datetime(prefix="gemma-serve-hexllm"),
base_model_id=f"google/{KAGGLE_MODEL_ID}",
model_id=output_folder,
service_account=SERVICE_ACCOUNT,
machine_type=machine_type,
max_num_batched_tokens=max_num_batched_tokens,
tokens_pad_multiple=tokens_pad_multiple,
seqs_pad_multiple=seqs_pad_multiple,
)
print("endpoint_name:", endpoint.name)
Specifications
- Version:
- Platform: colab enterprise
@KCFindstr: can you please take a look at this issue? Thanks.
From the original notebook, the correct machine type is ct5lp-hightpu-4t - You might have accidentally modified the machine type. Please let me know if ct5lp-hightpu-4t works.
I had originally tried with ct5lp-hightpu-4t It still gives the same error, also ct5lp-hightpu-1t is giving the same error. I have also tried raising a support request but there is no issue with permissions.
Still getting the same error
Hi @kathyyu-google , would you please take a look at this hex-llm deployment failure?
Based on the endpoint ID from the logs (projects/81995035742/locations/us-central1/endpoints/6941658909824253952), this endpoint was created in the us-central1 region. TPU deployment is supported only in the us-west1 region. Please update the variable REGION and re-attempt the deployment. Please also verify that there is available TPU quota (see the "Request for TPU quota" section of the notebook for more details).
ct5lp-hightpu-*t is the expected machine type here, such as ct5lp-hightpu-1t and ct5lp-hightpu-4t.
Is Hex-LLM opensource or private and can it be used with Cloud TPU API ?
Is Hex-LLM opensource or private and can it be used with Cloud TPU API ?
Hex-LLM is a closed-source serving solution. It can't be used directly with Cloud TPU; it can be used with deployments on Vertex Online Prediction. Example notebooks include https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_gemma2_deployment_on_vertex.ipynb.
Closing this issue as the original question about TPU machine type and region has been answered. Please reopen if needed, thanks!
Closing this issue as the original question about TPU machine type and region has been answered. Please reopen if needed, thanks!
Very good solution , but I think it still doesn't support most models, got error with Deepseek 7b model which is based on LLAMA 2 architecture, also it doesn't support Qwen2 models which are also very good, are you considering adding more models support in the future
Thank you @vanshi3214 for the suggestion! Currently it supports Gemma, Llama, Mistral and Mixtral models. We will support more models in the future and take your suggestion into consideration.