partial training of system
freezing some of the modules would allow training adapter as actual adapters.
Ideally, this would entail introducing some mechanism to mark in the config specific layerstacks/adapaters as not requiring gradient.
To be confirmed, but we can probably just do a combination of the following to get the desired behavior:
- leave marked models of all communication groups
- not apply the forward
has_gradhook to these models - remove them from gradient computations with
module.requires_grad_(False)
Basically you need to set param.requires_grad to false for the modules that should be frozen. If you do this in model_builder.py between creating the NMTModel and the call to create_adapters, then the former will be frozen and only the latter will be trained. In any case, you want to do this before registering the has_grad_hook, which happens a few lines later.
If empty communication groups are an issue (and they probably are), then you need to add a flag to TaskQueueManager.get_distributed_groups that makes sure that the keys encoder, decoder, src_emb, and tgt_emb are empty before returning my_distributed_groups. Either don't populate them, or clear them before returning.
It may also be a good idea to prevent some of the sub-optimizers from being created (in utils/optimizers.py attention_bridge_optimizer), but that may not even be necessary (I think the optimizers can handle being empty). You can try fist without this and implement it if needed.