tnt wraps DDP models with DSD

wraps DDP models with DSD

Open LucasLLC opened this issue 7 months ago • 2 comments

Summary: Distributed State Dict is the current suggested way from PyTorch for ensuring parallelized models state dicts are compatible with save/loads in Single process or re-sharding scenarios.

This diff updates dcp_saver to use DSD for DDP models. A good idea would be wrap all models in TNT with DSD, as this could replace some of the wrapper logic for FSDP and would guarantee future compat.

N5551629 also contains a workaround for current DDP model saved before this diff, by manually removing the "module." prefix in the checkpoint.

Differential Revision: D59234083

Jul 02 '24 14:07 LucasLLC

tnt tnt copied to clipboard

wraps DDP models with DSD

tnt
tnt copied to clipboard