xla
xla copied to clipboard
Addin a knob to control the limitation of async-compute resource.
📝 Summary of Changes
- Addin a knob to control the limitation of async-compute resource. This switch provides ample flexibility for control, enabling more asynchronous computations to execute concurrently. In host-offloading experiments, increasing this value effectively overlaps device-to-host (D2H) transfers with other computations, resulting in improved performance.
🎯 Justification This could allow users to control how many in-fligh async-computation in LHS.
🚀 Kind of Contribution ✨ New Feature,
Trying to understand the purpose of this CL:
for , increasing this value effectively overlaps device-to-host (D2H) transfers with other computations, can I assume both compute and D2H are treated as the async-compute now, so this is helpful the overlap.
Besides, Some general asks for all PRs (that also apply to this PR):
- How can we measure and track the performance delta for this PR? Can you provide speedups you measure on one of the HLO benchmarks in compiler/xla/tools/benchmarks/hlo/?
- Can you add or point to an execution test that exercises this code path?
for , increasing this value effectively overlaps device-to-host (D2H) transfers with other computations, can I assume both compute and D2H are treated as the async-compute now, so this is helpful the overlap.
IIUC, only DUS fused with D2H would be treated as async-compute. But we have many that kernels, and defulat value (2) would blokc scheduler to open start.
How can we measure and track the performance delta for this PR? Can you provide speedups you measure on one of the HLO benchmarks in compiler/xla/tools/benchmarks/hlo/?
We measured in DeepSeekV3-671B implemented with Maxtext and are seeing ~10% speedup end-2-end.
Can you add or point to an execution test that exercises this code path?
Could you advise what tests should we have for adding a flag?
We measured in DeepSeekV3-671B implemented with Maxtext and are seeing ~10% speedup end-2-end.
similar to the other CL, we would like adding this benchmark into our suite, so as to guard the optimization from this PR and related PR from future changes.
Can you add or point to an execution test that exercises this code path?
Could you advise what tests should we have for adding a flag?
my understanding is this PR opens us user from setting parallel compute on more than 2 streams. If so, can you add test for this scenario, e.g. 3 stream or 4?
Regarding the location, i think it depends on the use case you are trying to support here, e.g. in @Tixxx 's https://github.com/openxla/xla/pull/7854, test is added in /transforms/stream_attribute_annotator_test.cc for 2 stream cases.
my intention is to understand the use case here more concretely so it can benefit users from wider community.
similar to the other CL, we would like adding this benchmark into our suite, so as to guard the optimization from this PR and related PR from future changes.
I added a HLO in pull/33240.
my understanding is this PR opens us user from setting parallel compute on more than 2 streams. If so, can you add test for this scenario, e.g. 3 stream or 4?
Could we wait pull/33240 merged then extend its tests for testing #stream more than 2?