composer
composer copied to clipboard
Layer freezing does not support resuming from training
Layer freezing currently doesn't support resuming from checkpoints because of circular requirements:
- similar to model surgery, layer freezing modifies the optimizer param groups, and therefore needs to be applied before
load_checkpoint runs
, otherwise the optimizer param group loading will fail due to mismatches. - However, layer freezing needs to know the current epoch from the checkpoint to know what layers need to be frozen.
Separately, currently saving a model in the Trainer when using layer freezing fails with KeyError
because layer freezing does not also modify the optimizer state appropriately:
packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v for k, v in self.state.items()}
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.0 = <dict_itemiterator object at 0x147a432c0>
> packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v for k, v in self.state.items()}
E KeyError: 5171377232