Optimize multimodal resource allocation with concurrency and improved batch RPC

Open dyyoungg opened this issue 3 months ago • 0 comments

summary

This PR introduces a comprehensive performance overhaul of the multimodal resource allocation pipeline. It refactors both the httpserver.manager and the server (CacheServer) to replace sequential, "chatty" operations with a concurrent, batched approach. This significantly reduces latency and improves throughput, especially for requests with a large number of multimodal items.

bottleneck Problem

The original implementation was inefficient due to two primary bottlenecks:

Sequential Client-Side Processing: In httpserver.manager, I/O operations (reading files) and CPU-bound tasks (calculating MD5s, create_shm) for each multimodal item were executed one after another
RPC Overhead: The communication protocol was inefficient due to the main factor. The original exposed_alloc function signature was alloc(self, md5sum_list: list[str], token_num_list: list[int]). Although it handled lists, rpyc serializes each argument (md5sum_list and token_num_list) independently. This process involves significant overhead:
- rpyc has to traverse and serialize the entire structure of each list argument separately.
- This per-argument serialization is computationally expensive, and the cost increases with the number of items in the lists.

Solution (v2 Implementation)

✅ Concurrent Processing

The _alloc_multimodal_resources_v2 function now uses a ThreadPoolExecutor to concurrently read item data and calculate MD5 sums, fully leveraging available CPU cores.
Shared memory (SHM) creation is also parallelized using asyncio.gather to expedite resource setup.

✅ New Batched RPC Interface

To eliminate RPC chattiness, new exposed_*_v2 methods have been added to the CacheServer. Include alloc_v2, release_v2, set_items_data_v2, get_items_data_v2, set_items_embed_v2, get_items_embed_v2.

✅ Server-Side Batch Handling

The CacheServer's new v2 endpoints deserialize the request blob, process the batch of items internally, and return a single serialized response. This makes the server-side logic more efficient and cohesive.

✅ Feature Toggle

Added --enable_concurrent_alloc and --concurrent_alloc_workers parameters to control the new concurrent allocation behavior. This allows for gradual rollout and easy fallback to the original implementation if needed
In audioserver/visualserver.manager , get_items_embed is default to v2 implementation to reduce time.

Performance Evaluation

I evaluated the performance using images of the same size（644*364） during inference with our internal LLaVA-like model.
Testing Environment:

GPU: Single NVIDIA A100
CPU: Two Intel Xeon Platinum 8358P @ 2.60GHz (128 logical cores)
parameter: concurrent_alloc_workers=4

The reported values are averages and may fluctuate slightly, but not significantly.

Image num	cache_client.root.alloc time (ms)		_alloc_multimodal_resources func time (ms)
	origin	optimized	origin	optimized
4	43.4	1.60	129.8	5.95
8	43.2	1.49	129.8	5.90
16	42.2	1.00	128.0	5.96
32	42.4	1.16	130.6	10.0
64	44.0	1.58	141.2	15.9
128	44.6	2.17	142.6	26.4

Aug 21 '25 04:08 dyyoungg