djl-serving
djl-serving copied to clipboard
rolling batch does not work
##Description We have deployed a salesforce codegen-2b-multi model on a Nvidia GPU infrastructure with the following serving.properties
engine=MPI option.rolling_batch=lmi-dist # tested with both lmi-dist and auto option.max_rolling_batch_size=8 option.max_rolling_batch_prefill_tokens=1088 option.paged_attention=false option.model_loading_timeout = 3600 option.entryPoint=djl_python.deepspeed chunked_read_timeout= 3 option.tensor_parallel_degree=1 option.task=text-generation option.dtype=fp16 gpu.minWorkers=1 gpu.maxWorkers=1 log_model_metric=true metrics_aggregation=10
##Expected Behavior Rolling batching should be supported for DJL serving
##Error Message
INFO ModelServer BOTH API bind to: http://0.0.0.0:8080
WARN PyProcess W-88-models-stderr: [1,0]pad_token_id
to eos_token_id
:50256 for open-end generation.
WARN InferenceRequestHandler Chunk reading interrupted
java.lang.IllegalStateException: Read chunk timeout.
at ai.djl.inference.streaming.ChunkedBytesSupplier.next(ChunkedBytesSupplier.java:79) ~[api-0.23.0.jar:?]
at ai.djl.inference.streaming.ChunkedBytesSupplier.nextChunk(ChunkedBytesSupplier.java:93) ~[api-0.23.0.jar:?]
at ai.djl.serving.http.InferenceRequestHandler.sendOutput(InferenceRequestHandler.java:380) ~[serving-0.23.0.jar:?]
at ai.djl.serving.http.InferenceRequestHandler.lambda$runJob$5(InferenceRequestHandler.java:286) ~[serving-0.23.0.jar:?]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) [?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) [?:?]
at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:479) [?:?]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?]
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?]
at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]