Felix Wang

Results 13 comments of Felix Wang

Trying to understand the purpose of this CL: for `, increasing this value effectively overlaps device-to-host (D2H) transfers with other computations`, can I assume both compute and D2H are treated...

> We measured in DeepSeekV3-671B implemented with Maxtext and are seeing ~10% speedup end-2-end. > > > similar to the other CL, we would like adding this benchmark into our...

> > Can you add or point to an execution test that exercises this code path? > > Could you advise what tests should we have for adding a flag?...

Some general asks for all PRs (that also apply to this PR): - What workload is this change motivated by and why is it important? - How can we measure...

> > What workload is this change motivated by and why is it important? > > We observed in maxtext model-training with async-host-offloading, XLA scheduler would lean to schedule this...

Do you have a hlo for host offload to demonstrate this speed-up? Adding it into benchmark suite could guard host offloading features from regression for future development.

Thanks @giordano for the quick verification, just trying to understand effect the above XLA commit to help us better prioritize the optimization direction. can I assume, without the commit, [link](https://github.com/EnzymeAD/Enzyme-JAX/pull/1243#issuecomment-3146860015),...

Ah, thank you for this suggestion! Will use it going forward. apologies for overriding your commits, it's new to github review process, any suggestions are helpful and welcome!

I want to only update the xla commit to `1ac176a9b8b4800bc2753d944eec62a39e6189b8` to verify if the hlo dump looks as intended. No need to trigger OOM anymore. new commit failed with ```...

Yeah, a list of zero-bcast are coalesced as expected in https://github.com/EnzymeAD/Enzyme-JAX/pull/1307#issue-3324447702, e.g. from the new dump, `%broadcast.764` is the coalesced bcast. ``` %broadcast.764 = f64[760,1527]{1,0} broadcast(%constant.304), dimensions={}, metadata={op_name="pad.5725"} %collective-permute.1 =...