composer
composer copied to clipboard
Fix error on HYBRID_SHARD misconfiguraiton
trafficstars
What does this PR do?
If you use HYBRID_SHARD but only specify a single element list for device_mesh, it gets to this ValueError, but then the message within the ValueError fails, because apparently when DeviceMesh is 1-D, device_mesh.get_group() returns a ProcessGroup object rather than a List[ProcessGroup].
Using device_mesh.ndim for the error reporting sidesteps the problem.
Is it not caught earlier? It should be checked here: https://github.com/mosaicml/composer/blob/f65bb27800810240479c1d4802f124a9b60b2096/composer/trainer/dist_strategy.py#L319-L342