composer icon indicating copy to clipboard operation
composer copied to clipboard

Fix error on HYBRID_SHARD misconfiguraiton

Open abhi-mosaic opened this issue 1 year ago • 1 comments
trafficstars

What does this PR do?

If you use HYBRID_SHARD but only specify a single element list for device_mesh, it gets to this ValueError, but then the message within the ValueError fails, because apparently when DeviceMesh is 1-D, device_mesh.get_group() returns a ProcessGroup object rather than a List[ProcessGroup].

Using device_mesh.ndim for the error reporting sidesteps the problem.

abhi-mosaic avatar Mar 12 '24 18:03 abhi-mosaic

Is it not caught earlier? It should be checked here: https://github.com/mosaicml/composer/blob/f65bb27800810240479c1d4802f124a9b60b2096/composer/trainer/dist_strategy.py#L319-L342

mvpatel2000 avatar Mar 13 '24 18:03 mvpatel2000