torchrec icon indicating copy to clipboard operation
torchrec copied to clipboard

Fix None process group appended in _fuse_input_dist_splits

Open YongzhongYang opened this issue 1 month ago • 1 comments

Summary: We identified this potential bug during debugging issue reported in https://fb.workplace.com/groups/755371733754414/permalink/833999072850393/

Fixed a bug in _fuse_input_dist_splits where names with no valid process group (pg=None) were being added to names_per_pg[None]. This would cause issues downstream when trying to create FusedKJTListSplitsAwaitable with a None process group.

The issue occurred when:

  1. A request is of type KJTListSplitsAwaitable
  2. None of its awaitables are of type KJTSplitsAllToAllMeta
  3. This leaves pg = None (line 207)
  4. The name was still appended to names_per_pg[None] (line 213)

The fix adds a check to only append names when pg is not None, ensuring that only requests with valid process groups are included in the fused operations.

Why this matters:

  • Prevents passing pg=None to FusedKJTListSplitsAwaitable (line 232)
  • Ensures only valid distributed operations are fused together
  • Avoids potential runtime errors or undefined behavior

Differential Revision: D87110878

YongzhongYang avatar Nov 14 '25 23:11 YongzhongYang

@YongzhongYang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87110878.

meta-codesync[bot] avatar Nov 14 '25 23:11 meta-codesync[bot]