Optimize multimodal resource allocation with concurrency and improved batch RPC
summary
This PR introduces a comprehensive performance overhaul of the multimodal resource allocation pipeline. It refactors both the httpserver.manager and the server (CacheServer) to replace sequential, "chatty" operations with a concurrent, batched approach. This significantly reduces latency and improves throughput, especially for requests with a large number of multimodal items.
bottleneck Problem
The original implementation was inefficient due to two primary bottlenecks:
- Sequential Client-Side Processing: In
httpserver.manager, I/O operations (reading files) and CPU-bound tasks (calculating MD5s,create_shm) for each multimodal item were executed one after another - RPC Overhead: The communication protocol was inefficient due to the main factor.
The original exposed_alloc function signature was
alloc(self, md5sum_list: list[str], token_num_list: list[int]). Although it handled lists, rpyc serializes each argument(md5sum_list and token_num_list)independently. This process involves significant overhead:- rpyc has to traverse and serialize the entire structure of each list argument separately.
- This per-argument serialization is computationally expensive, and the cost increases with the number of items in the lists.
Solution (v2 Implementation)
✅ Concurrent Processing
- The _alloc_multimodal_resources_v2 function now uses a
ThreadPoolExecutorto concurrently read item data and calculate MD5 sums, fully leveraging available CPU cores. - Shared memory (SHM) creation is also parallelized using
asyncio.gatherto expedite resource setup.
✅ New Batched RPC Interface
- To eliminate RPC chattiness, new
exposed_*_v2methods have been added to the CacheServer. Includealloc_v2,release_v2,set_items_data_v2,get_items_data_v2,set_items_embed_v2,get_items_embed_v2.
✅ Server-Side Batch Handling
- The
CacheServer'snew v2 endpoints deserialize the request blob, process the batch of items internally, and return a single serialized response. This makes the server-side logic more efficient and cohesive.
✅ Feature Toggle
- Added
--enable_concurrent_allocand--concurrent_alloc_workersparameters to control the new concurrent allocation behavior. This allows for gradual rollout and easy fallback to the original implementation if needed - In
audioserver/visualserver.manager,get_items_embedis default to v2 implementation to reduce time.
Performance Evaluation
I evaluated the performance using images of the same size(644*364) during inference with our internal LLaVA-like model.
Testing Environment:
- GPU: Single NVIDIA A100
- CPU: Two Intel Xeon Platinum 8358P @ 2.60GHz (128 logical cores)
- parameter:
concurrent_alloc_workers=4
The reported values are averages and may fluctuate slightly, but not significantly.
| Image num | cache_client.root.alloc time (ms) | _alloc_multimodal_resources func time (ms) | ||
|---|---|---|---|---|
| origin | optimized | origin | optimized | |
| 4 | 43.4 | 1.60 | 129.8 | 5.95 |
| 8 | 43.2 | 1.49 | 129.8 | 5.90 |
| 16 | 42.2 | 1.00 | 128.0 | 5.96 |
| 32 | 42.4 | 1.16 | 130.6 | 10.0 |
| 64 | 44.0 | 1.58 | 141.2 | 15.9 |
| 128 | 44.6 | 2.17 | 142.6 | 26.4 |