feat: implement bandwidth-aware shard assignment for pipeline parallelism

Open abendrothj opened this issue 2 months ago • 0 comments

This PR implements bandwidth-aware shard assignment for pipeline parallelism, minimizing total inference time by assigning more layers to devices with higher memory bandwidth. This aligns with Issue #957.

Changes

Profiling: Added memory_bandwidth to NodePerformanceProfile and Apple Silicon bandwidth data.
Assignment: Implemented greedy assignment algorithm in placement_utils.py.
Verification: Added test_get_shard_assignments_bandwidth_aware to confirm optimal distribution.

Jan 03 '26 13:01 abendrothj