composer Layer freezing does not support resuming from training

Layer freezing does not support resuming from training

Open hanlint opened this issue 2 years ago • 0 comments

Layer freezing currently doesn't support resuming from checkpoints because of circular requirements:

similar to model surgery, layer freezing modifies the optimizer param groups, and therefore needs to be applied before load_checkpoint runs, otherwise the optimizer param group loading will fail due to mismatches.
However, layer freezing needs to know the current epoch from the checkpoint to know what layers need to be frozen.

Separately, currently saving a model in the Trainer when using layer freezing fails with KeyError because layer freezing does not also modify the optimizer state appropriately:

    packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v for k, v in self.state.items()}
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

.0 = <dict_itemiterator object at 0x147a432c0>

>   packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v for k, v in self.state.items()}
E   KeyError: 5171377232

May 05 '22 17:05 hanlint

composer composer copied to clipboard

Layer freezing does not support resuming from training

composer
composer copied to clipboard