[POC] Encoder Disaggregation
Not ready to be merged or fully reviewed yet.
However, since we have already implemented the essential building blocks and passed my naive single request test, I propose a draft PR here for whom it may concern.
Motivation
InternVL3.5 proposed Decoupled Vision Deployment (DvD), also known as Encode-Prefill-Decode Disaggregation (EPD) in many other papers. We use the term EPD in the following descriptions. This paradigm has the potential to improve Time-to-First-Token (TTFT) and throughput.
Design
In this section, we explain the design logic of EPD in lmdeploy.
- Entire workflow
Sequence Diagram (click to expand)
sequenceDiagram
participant Client
participant API Server (Proxy)
participant AsyncEngine
participant Encoder Engine
participant LLM Engine
participant P2P Connection (ZMQ)
%% Phase 1: Encoder Processing
Client->>+API Server (Proxy): POST /v1/chat/completions (prompt, image_data)
API Server (Proxy)->>+AsyncEngine: generate(prompt, image_data)
AsyncEngine->>+Encoder Engine: step(ADD_MESSAGE, image_data)
Encoder Engine->>Encoder Engine: Process image, generate feature embeddings
Note right of Encoder Engine: Features stored in its local cache blocks
Encoder Engine-->>-AsyncEngine: InferOutput (encoder_result)
Note over AsyncEngine, Encoder Engine: encoder_result contains feature block_ids
%% Phase 2: Feature Cache Migration
AsyncEngine->>+LLM Engine: step(ADD_MESSAGE, prompt, encoder_result)
LLM Engine->>LLM Engine: _on_add_message() -> _add_message()
Note right of LLM Engine: Sequence created, status = WAITING_EP_MIGRATION
LLM Engine->>LLM Engine: _async_loop_ep_migration() -> schedule_ep_migration()
Note right of LLM Engine: Sequence status -> RUNNING_EP_MIGRATION
LLM Engine->>LLM Engine: executor.migrate(encoder_blocks -> llm_blocks)
Note right of LLM Engine: Physical copy of feature embeddings
LLM Engine->>+P2P Connection (ZMQ): zmq_send(ack)
P2P Connection (ZMQ)->>+Encoder Engine: Receive ACK
Encoder Engine->>Encoder Engine: Free its copy of the feature cache blocks
deactivate P2P Connection (ZMQ)
deactivate Encoder Engine
Note right of LLM Engine: After migration, sequence status -> WAITING
%% Phase 3: LLM Prefill & Decode
AsyncEngine->>+LLM Engine: step(empty request to trigger prefill)
LLM Engine->>LLM Engine: _async_loop_main() -> schedule(is_prefill=True)
Note right of LLM Engine: Sequence status -> RUNNING
LLM Engine->>LLM Engine: executor.forward(prompt + migrated_features)
Note right of LLM Engine: GPU performs prefill, generates 1st token
LLM Engine-->>-AsyncEngine: InferOutput (1st token)
AsyncEngine-->>API Server (Proxy): Stream 1st token
API Server (Proxy)-->>Client: Stream 1st token
loop Decode Loop
AsyncEngine->>+LLM Engine: step(empty request to trigger decode)
LLM Engine->>LLM Engine: _async_loop_main() -> schedule(is_prefill=False)
LLM Engine->>LLM Engine: executor.forward(last_token)
LLM Engine-->>-AsyncEngine: InferOutput (next_token)
AsyncEngine-->>API Server (Proxy): Stream next_token
API Server (Proxy)-->>Client: Stream next_token
end
%% Phase 4: Finish
Note over LLM Engine, AsyncEngine: Generation finishes (EOS/max_tokens)
LLM Engine-->>AsyncEngine: InferOutput (finish=True)
AsyncEngine->>+LLM Engine: step(END_SESSION)
LLM Engine->>LLM Engine: _on_end_session() -> scheduler.end_session()
Note right of LLM Engine: Frees all resources for the session
deactivate LLM Engine
AsyncEngine-->>API Server (Proxy): Close stream
API Server (Proxy)-->>Client: Close connection
deactivate API Server (Proxy)
deactivate AsyncEngine
PD distserve attaches migration_request to P instance response, and routes to D instance. Similarly, we propose a new attribute encoder_result attached to the E instance response, and routes to the PD instance.
- State transitions
State Diagram (click to expand)
stateDiagram-v2
direction LR
[*] --> WAITING_EPD_MIGRATION: Request with encoder_result
[*] --> WAITING_MIGRATION: Request with migration_request
[*] --> WAITING: Standard Request
state "Encoder-Prefill-Decode Path" as EPD_Path {
WAITING_EPD_MIGRATION --> RUNNING_EPD_MIGRATION: Scheduler._schedule_epd_migration()
RUNNING_EPD_MIGRATION --> EPD_MIGRATION_LOCKED: Engine locks after migration
EPD_MIGRATION_LOCKED --> WAITING: Engine unlocks, ready for prefill
}
state "Prefill-Decode Path" as PD_Path {
WAITING_MIGRATION --> RUNNING_MIGRATION: Scheduler._schedule_migration()
RUNNING_MIGRATION --> MIGRATION_LOCKED: Engine locks after migration
MIGRATION_LOCKED --> MIGRATION_DONE: Engine unlocks
MIGRATION_DONE --> RUNNING: Scheduler.collect_migration_done()
}
state "Standard Inference Path" as Standard_Path {
WAITING --> RUNNING: Scheduler._schedule_prefill()
RUNNING --> LOCKED: Engine locks for forward pass
LOCKED --> RUNNING: Engine unlocks after forward pass
RUNNING --> WAITING: Evicted during decode scheduling
}
RUNNING --> ENDED: Generation finished (EOS/max_tokens)
RUNNING --> STOPPED: User cancelled
RUNNING --> ABORTED: Error (e.g., OOM)
STOPPED --> ENDED
ABORTED --> ENDED
ENDED --> [*]
To migrate features from the E instance to the PD instance, we add relevant scheduling logic inside the PyTorch engine. Specifically, we treat the scheduling and migration E -> PD as an extension of the current PD disaggregation, adding extra states such as WAITING_EPD_MIGRATION, RUNNING_EPD_MIGRATION, EPD_MIGRATION_LOCKED
Modifications
Modifications are threefold:
- Proxy Router / EPD Connections. Credit to @FirwoodLin -- New engine role 'Encoder'. -- Proxy routing. -- P2P connections/initializations.
- Multimodal Engine -- A separate engine for the encoder. -- A multimodal cache engine. Credit to @FirwoodLin
- LLM Engine -- Accept results from the encoder side. -- Schedule multimodal cache migration.
Performance
TODO
Tasks
- [ ] Extensive refactoring and fixing -- Multimodal engine -- Minimal modifications to the LLM engine -- Proxy routing logic
- [ ] Acc and performance test -- Multi-batch -- Metrics implementation -- Performance test and optimizations
- [ ] Compatible with turbomind
- [ ] Preserve previous usages for VL models (non-EPD mode)
- [x] Automatic p2p connection warmup
- [x] Disable LLM weight loading for MM engine
- [ ] Extend EPD for QwenVL series and more
Related
- https://github.com/InternLM/lmdeploy/issues/3905