aibrix [RFC]: Batch API for inference job

Summary

This aims to expose batch API to users so that they can submit a batch job and retrieve job’s status and results anytime after job submission. However, current inference engines such as vllm, do not support such batch feature. This design is to fill in the gap between them.

Motivation

In order to support batch API for users doing batch inference jobs, our inference system needs to handle batch jobs' input and output and do time-based scheduling, both of which are not the scope of inference engine. In the following, we divide motivation into two parts: one part belongs to fundamental capabilities and the other part is for optimization to achieve better performance.

This part lists essential components to make E2E batch inference work. This is motivated by the need to

Enable storage for users' input and output.
Manage batch jobs' due time and other status information.
Schedule jobs FIFO, with time-based sliding window.
Guarantee at least once execution

With all basic capabilities becoming ready, this part focuses on performance improvement.

Fault tolerance and consistency
Scalable that supports a large number of batch jobs
Job scheduling. In order to maximize meeting jobs' due time, we can schedule jobs in a more efficient way instead of using FIFO. This is transparent to inference engine.
More fine-grained request scheduling, compatible with inference engines such as Pipeline parallelism.

Proposed Change

For the first part, this proposal builds several fundamental components to support OpenAI batch API.

A persistent storage. (a). Store job's input and output. This works as persistent storage to serve users' requests for retrieval. (b). Interfaces for read/write, request input and output, job metadata
Job metadata management (a). Handle jobs' state transition. This should clearly outline the transition diagram among different states. (b). Manage jobs' status, including job creation time, current status, scheduled resources and so on. T (c). Persistent on storage in terms of checkpoints. With this, users can directly retrieve jobs' status consistently.
Job scheduling (a). Maintaining time-based sliding window of jobs. Based on job creation time, this slides the job window every minute. (b). do FIFO job scheduling and do request queries to inference engine. This will prepare all necessary input for inference engine. (c). Sync job's status. When received response from inference engine, this sync to and job window and metadata management.

The proposed changes listed in the second part of motivation are paused here for now. When we have a clear outline of the fundamentals, we should have a better understanding of tasks for optimization.

Alternatives Considered

No response

Sep 16 '24 16:09 xinchen384

Cool! Great RFC overall.

Sep 19 '24 05:09 xieus

Remove from v0.2.0 release and move to v0.3.0.

Nov 19 '24 18:11 Jeffwan

I will consider to rewrite this issue. Seems the original proposed idea is not very close to what we need at this moment now.

Jun 27 '25 01:06 Jeffwan