@AlexCheema

Implement Parallel Model Preloading

Description

This PR introduces parallel model preloading to significantly reduce startup times for large models distributed across multiple nodes. By leveraging asyncio, we now preload model shards into memory concurrently, followed by a sequential initialization step.

Changes

Added preload_model method to the InferenceEngine abstract class
Implemented preload_model in MLXDynamicShardInferenceEngine
Updated ensure_shard method to work with preloaded models
Modified main.py to use parallel preloading

Implementation Details

InferenceEngine now has an abstract preload_model method
MLXDynamicShardInferenceEngine.preload_model loads model config and weights without full initialization
ensure_shard completes initialization using preloaded data
Main script uses asyncio.gather for parallel preloading

Performance Improvements

Startup time for multi-shard models is expected to decrease significantly
Resource utilization during startup is more efficient

How to Test

Run the main script with a multi-shard model
Observe logs for parallel preloading and sequential initialization
Compare startup times with the previous sequential loading approach

Future Work

Fine-tune the balance between parallel preloading and sequential initialization
Implement similar optimizations for other inference engines (e.g., TinyGrad)

If you feel like supporting me:

https://buymeacoffee.com/aybanda

Sep 09 '24 11:09 aybanda

Hey, is this AI generated?

We don't accept AI generated PR's.

This doesn't really achieve its intended purpose: calling preload_model in main.py doesn't really make sense since exo doesn't know up front which shards you are going to use.

Sep 09 '24 13:09 AlexCheema

Hey @AlexCheema I got your point and yes I have generated using AI

Instead of preloading in main.py, we could modify the ensure_shard method to implement a more efficient loading process. Here's a approach that might work better with your design

In MLXDynamicShardInferenceEngine modifying ensure_shard This approach will be more suitable I guess Loads config and weights concurrently Doesn't require changes to main.py or other parts of exo Keeps the loading process within the ensure_shard method, maintaining your existing architecture

If you are interested in this let me know, I will change the code accordingly.

Sep 09 '24 13:09 aybanda

Implement parallel model preloading

Implement Parallel Model Preloading

Description

Changes

Implementation Details

Performance Improvements

How to Test

Future Work