exo
exo copied to clipboard
feat: implement bandwidth-aware shard assignment for pipeline parallelism
This PR implements bandwidth-aware shard assignment for pipeline parallelism, minimizing total inference time by assigning more layers to devices with higher memory bandwidth. This aligns with Issue #957.
Changes
-
Profiling: Added
memory_bandwidthtoNodePerformanceProfileand Apple Silicon bandwidth data. -
Assignment: Implemented greedy assignment algorithm in
placement_utils.py. -
Verification: Added
test_get_shard_assignments_bandwidth_awareto confirm optimal distribution.