lightllm icon indicating copy to clipboard operation
lightllm copied to clipboard

Optimize multimodal resource allocation with concurrency and improved batch RPC

Open dyyoungg opened this issue 3 months ago • 0 comments

summary

This PR introduces a comprehensive performance overhaul of the multimodal resource allocation pipeline. It refactors both the httpserver.manager and the server (CacheServer) to replace sequential, "chatty" operations with a concurrent, batched approach. This significantly reduces latency and improves throughput, especially for requests with a large number of multimodal items.

bottleneck Problem

The original implementation was inefficient due to two primary bottlenecks:

  1. Sequential Client-Side Processing: In httpserver.manager, I/O operations (reading files) and CPU-bound tasks (calculating MD5s, create_shm) for each multimodal item were executed one after another
  2. RPC Overhead: The communication protocol was inefficient due to the main factor. The original exposed_alloc function signature was alloc(self, md5sum_list: list[str], token_num_list: list[int]). Although it handled lists, rpyc serializes each argument (md5sum_list and token_num_list) independently. This process involves significant overhead:
    • rpyc has to traverse and serialize the entire structure of each list argument separately.
    • This per-argument serialization is computationally expensive, and the cost increases with the number of items in the lists.

Solution (v2 Implementation)

✅ Concurrent Processing

  • The _alloc_multimodal_resources_v2 function now uses a ThreadPoolExecutor to concurrently read item data and calculate MD5 sums, fully leveraging available CPU cores.
  • Shared memory (SHM) creation is also parallelized using asyncio.gather to expedite resource setup.

✅ New Batched RPC Interface

  • To eliminate RPC chattiness, new exposed_*_v2 methods have been added to the CacheServer. Include alloc_v2, release_v2, set_items_data_v2, get_items_data_v2, set_items_embed_v2, get_items_embed_v2.

✅ Server-Side Batch Handling

  • The CacheServer's new v2 endpoints deserialize the request blob, process the batch of items internally, and return a single serialized response. This makes the server-side logic more efficient and cohesive.

✅ Feature Toggle

  • Added --enable_concurrent_alloc and --concurrent_alloc_workers parameters to control the new concurrent allocation behavior. This allows for gradual rollout and easy fallback to the original implementation if needed
  • In audioserver/visualserver.manager , get_items_embed is default to v2 implementation to reduce time.

Performance Evaluation

I evaluated the performance using images of the same size(644*364) during inference with our internal LLaVA-like model.
Testing Environment:

  • GPU: Single NVIDIA A100
  • CPU: Two Intel Xeon Platinum 8358P @ 2.60GHz (128 logical cores)
  • parameter: concurrent_alloc_workers=4

The reported values are averages and may fluctuate slightly, but not significantly.

Image num cache_client.root.alloc time (ms) _alloc_multimodal_resources func time (ms)
origin optimized origin optimized
4 43.4 1.60 129.8 5.95
8 43.2 1.49 129.8 5.90
16 42.2 1.00 128.0 5.96
32 42.4 1.16 130.6 10.0
64 44.0 1.58 141.2 15.9
128 44.6 2.17 142.6 26.4

dyyoungg avatar Aug 21 '25 04:08 dyyoungg