djl-serving
djl-serving copied to clipboard
Endpoint Timeout after 60 Seconds for Longer Generation
Description
I have deployed the model Qwen 2.5 32B on g5.48xlarge instance. The configuration that I'm using are the following: "OPTION_GPU_MEMORY_UTILIZATION": "0.9", "OPTION_MAX_MODEL_LEN": "16000", "OPTION_ROLLING_BACK": "vllm", "TENSOR_PARALLEL_SIZE": "8", "PREDICT_TIMEOUT": "600"
Also I'm loading the model from a S3 Location. The model is getting deployed perfectly and able to generate the results but when the generation is long and it take more than 1 minute then we are getting a timeout.
I saw that it seems like a sagemaker endpoint issue. https://github.com/aws/sagemaker-python-sdk/issues/1119 Is there any arguments that we can pass to increase the timeout. Is it even supported by DJL LMI serving, I tried creating an Async Endpoint but it was showing the following error:
AsyncInferenceModelError: Model returned error: b'Connection reset by peer for the
my-lmi-async-endpoint-2024-12-27-09-01-34-305 endpoint. Please retry.'
Logs
<html>
<body>
<!--StartFragment-->
2024-12-27T13:35:07.118Z | [INFO ] PyProcess - W-167-06992c5fb70498b-stdout: [1,0]<stdout>:INFO::[RequestId=752a507f-2bc4-40db-93b4-b50ce3b87aaf] parsed and scheduled for inference
-- | --
| 2024-12-27T13:36:02.993Z | [WARN ] InferenceRequestHandler - Chunk reading interrupted
| 2024-12-27T13:36:02.993Z | java.lang.IllegalStateException: Read chunk timeout.
| 2024-12-27T13:36:02.993Z | #011at ai.djl.inference.streaming.ChunkedBytesSupplier.next(ChunkedBytesSupplier.java:79) ~[api-0.31.0.jar:?]
| 2024-12-27T13:36:02.993Z | #011at ai.djl.inference.streaming.ChunkedBytesSupplier.nextChunk(ChunkedBytesSupplier.java:93) ~[api-0.31.0.jar:?]
| 2024-12-27T13:36:02.993Z | #011at ai.djl.serving.http.InferenceRequestHandler.sendOutput(InferenceRequestHandler.java:418) ~[serving-0.31.0.jar:?]
| 2024-12-27T13:36:02.993Z | #011at ai.djl.serving.http.InferenceRequestHandler.lambda$runJob$5(InferenceRequestHandler.java:313) ~[serving-0.31.0.jar:?]
| 2024-12-27T13:36:02.993Z | #011at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) [?:?]
| 2024-12-27T13:36:02.993Z | #011at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) [?:?]
| 2024-12-27T13:36:02.993Z | #011at java.base/java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:483) [?:?]
| 2024-12-27T13:36:02.993Z | #011at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) [?:?]
| 2024-12-27T13:36:02.993Z | #011at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) [?:?]
| 2024-12-27T13:36:02.993Z | #011at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) [?:?]
| 2024-12-27T13:36:02.993Z | #011at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) [?:?]
| 2024-12-27T13:36:07.117Z | #011at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) [?:?]
<!--EndFragment-->
</body>
</html>
Hi @prashantsolanki975 - sorry for the delay here.
We had some reports of similar issues to what you are facing, and the resolution requires setting another configuration:
SERVING_CHUNKED_READ_TIMEOUT=<time in seconds>.
Depending on the maximum request time you expect, you should set this configuration accordingly. This is needed for both real-time endpoints, and async endpoints (only supported from djl 0.31.0 onwards).
As for why this configuration is needed, we had made a change for how we handle non-streaming requests to support both async endpoints, and proper HTTP status codes. This code change still leverages much of the same code for streaming, but it had unforeseen impact on non-streaming inference requests that take longer than 60 seconds.
If you are planning to leverage this on real-time endpoints, I also recommend you validate that your SageMaker endpoint has been setup with a longer timeout (default is 60 seconds).
I hope this helps. I will be updating our docs, and including a test that will exercise this behavior.
We had some reports of similar issues to what you are facing, and the resolution requires setting another configuration:
SERVING_CHUNKED_READ_TIMEOUT=<time in seconds>.
@siddvenk sorry but where can I add/change this default config? I can't find the related documents about this?
You can set it as an environment variable. I will need to update our docs to reflect this configuration.
@siddvenk could you please share the docs for SERVING_CHUNKED_READ_TIMEOUT? Would like to understand it's default value and behaviour.
This issue is stale because it has been open for 30 days with no activity.
@Saketh-nakkina did you manage to solve the issue? We are facing exactly the same issue