vertex-ai-samples
vertex-ai-samples copied to clipboard
Internal Server Error When Sending Predict Requests to Vertex AI Endpoint with Mixtrail 8x7b instruct Model
Issue Description
When sending prediction requests to a Vertex AI endpoint using the Mistral model, I encounter an InternalServerError
with details hinting at resource constraints or async execution issues, specifically mentioning OutOfMemoryError
and errors in async_llm_engine.py
, model_runner.py
, and mixtral.py
.
Expected Behavior
The model should process the provided prompt and return a generated response without internal server errors.
Actual Behavior
Requests to the model via the Vertex AI endpoint result in an _InactiveRpcError
and a 500 Internal Server Error
, indicating potential memory allocation or async execution failures.
In fact for a "small" request under around 100 token the model respond but over 100 token there is this error.
Steps to Reproduce
- Deploy the Mistral model to Vertex AI following the steps in this notebook.
- Configure the endpoint with the specified parameters.
- Send a prediction request to the endpoint with a detailed prompt.
- Observe the returned internal server error.
Specifications
- Model Version: Mixtral-8x7B-Instruct-v0.1
- Platform: Colab Enterprise in vertex ai
Logs and Error Messages
The error logs indicate issues such as OutOfMemoryError
and failures in async execution paths within the model's implementation.
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
Above this logs there is this one :
"message": "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 5 has a total capacty of 21.96 GiB of which 14.88 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.00 GiB is allocated by PyTorch, and 115.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"
Maybe it's a memory issue but im following the exact notebook specification with this config and with a context windows of max_model_len = 4096
as specified:
# Sets 8 L4s to deploy Mixtral 8x7B. machine_type = "g2-standard-96" accelerator_type = "NVIDIA_L4" accelerator_count = 8
Additional Context
This issue appears to be related to how the model processes large inputs or manages resources during execution.
@kathyyu-google can you please help or refer to someone? Thanks
Thank you @MangoHiller for bring this to our attention. We were able to reproduce your error with long prompts, which led to high memory usage. We found that the error could be avoided by decreasing the model server argument --gpu-memory-utilization
to 0.85. This argument is defined in the function deploy_model_vllm
and originally set to 0.9.
Marking this issue as closed. Please feel free to reopen if there are further comments or prediction errors. Thank you again!