djl-serving icon indicating copy to clipboard operation
djl-serving copied to clipboard

rolling batch does not work

Open prgawade opened this issue 8 months ago • 1 comments

##Description We have deployed a salesforce codegen-2b-multi model on a Nvidia GPU infrastructure with the following serving.properties

engine=MPI option.rolling_batch=lmi-dist # tested with both lmi-dist and auto option.max_rolling_batch_size=8 option.max_rolling_batch_prefill_tokens=1088 option.paged_attention=false option.model_loading_timeout = 3600 option.entryPoint=djl_python.deepspeed chunked_read_timeout= 3 option.tensor_parallel_degree=1 option.task=text-generation option.dtype=fp16 gpu.minWorkers=1 gpu.maxWorkers=1 log_model_metric=true metrics_aggregation=10

##Expected Behavior Rolling batching should be supported for DJL serving

##Error Message

INFO ModelServer BOTH API bind to: http://0.0.0.0:8080 WARN PyProcess W-88-models-stderr: [1,0]:Setting pad_token_id to eos_token_id:50256 for open-end generation. WARN InferenceRequestHandler Chunk reading interrupted java.lang.IllegalStateException: Read chunk timeout. at ai.djl.inference.streaming.ChunkedBytesSupplier.next(ChunkedBytesSupplier.java:79) ~[api-0.23.0.jar:?] at ai.djl.inference.streaming.ChunkedBytesSupplier.nextChunk(ChunkedBytesSupplier.java:93) ~[api-0.23.0.jar:?] at ai.djl.serving.http.InferenceRequestHandler.sendOutput(InferenceRequestHandler.java:380) ~[serving-0.23.0.jar:?] at ai.djl.serving.http.InferenceRequestHandler.lambda$runJob$5(InferenceRequestHandler.java:286) ~[serving-0.23.0.jar:?] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) [?:?] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) [?:?] at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:479) [?:?] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?] at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?] at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]

prgawade avatar Oct 10 '23 10:10 prgawade