vertex-ai-samples Internal Server Error When Sending Predict Requests to Vertex AI Endpoint with Mixtrail 8x7b instruct Model

Internal Server Error When Sending Predict Requests to Vertex AI Endpoint with Mixtrail 8x7b instruct Model

Open MangoHiller opened this issue 1 year ago • 1 comments

Issue Description

When sending prediction requests to a Vertex AI endpoint using the Mistral model, I encounter an InternalServerError with details hinting at resource constraints or async execution issues, specifically mentioning OutOfMemoryError and errors in async_llm_engine.py, model_runner.py, and mixtral.py.

Expected Behavior

The model should process the provided prompt and return a generated response without internal server errors.

Actual Behavior

Requests to the model via the Vertex AI endpoint result in an _InactiveRpcError and a 500 Internal Server Error, indicating potential memory allocation or async execution failures.

In fact for a "small" request under around 100 token the model respond but over 100 token there is this error.

Steps to Reproduce

Deploy the Mistral model to Vertex AI following the steps in this notebook.
Configure the endpoint with the specified parameters.
Send a prediction request to the endpoint with a detailed prompt.
Observe the returned internal server error.

Specifications

Model Version: Mixtral-8x7B-Instruct-v0.1
Platform: Colab Enterprise in vertex ai

Logs and Error Messages

The error logs indicate issues such as OutOfMemoryError and failures in async execution paths within the model's implementation.

vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

Above this logs there is this one : "message": "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 5 has a total capacty of 21.96 GiB of which 14.88 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.00 GiB is allocated by PyTorch, and 115.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

Maybe it's a memory issue but im following the exact notebook specification with this config and with a context windows of max_model_len = 4096 as specified:

# Sets 8 L4s to deploy Mixtral 8x7B. machine_type = "g2-standard-96" accelerator_type = "NVIDIA_L4" accelerator_count = 8

Additional Context

This issue appears to be related to how the model processes large inputs or manages resources during execution.

Feb 16 '24 12:02 MangoHiller

@kathyyu-google can you please help or refer to someone? Thanks

Feb 22 '24 20:02 gericdong

Thank you @MangoHiller for bring this to our attention. We were able to reproduce your error with long prompts, which led to high memory usage. We found that the error could be avoided by decreasing the model server argument --gpu-memory-utilization to 0.85. This argument is defined in the function deploy_model_vllm and originally set to 0.9.

Marking this issue as closed. Please feel free to reopen if there are further comments or prediction errors. Thank you again!

Mar 16 '24 20:03 kathyyu-google

vertex-ai-samples vertex-ai-samples copied to clipboard

Internal Server Error When Sending Predict Requests to Vertex AI Endpoint with Mixtrail 8x7b instruct Model

Issue Description

Expected Behavior

Actual Behavior

Steps to Reproduce

Specifications

Logs and Error Messages

Additional Context

vertex-ai-samples
vertex-ai-samples copied to clipboard