mammoth icon indicating copy to clipboard operation
mammoth copied to clipboard

partial training of system

Open TimotheeMickus opened this issue 2 years ago • 1 comments

freezing some of the modules would allow training adapter as actual adapters.

Ideally, this would entail introducing some mechanism to mark in the config specific layerstacks/adapaters as not requiring gradient.

To be confirmed, but we can probably just do a combination of the following to get the desired behavior:

  • leave marked models of all communication groups
  • not apply the forward has_grad hook to these models
  • remove them from gradient computations with module.requires_grad_(False)

TimotheeMickus avatar Sep 29 '23 07:09 TimotheeMickus

Basically you need to set param.requires_grad to false for the modules that should be frozen. If you do this in model_builder.py between creating the NMTModel and the call to create_adapters, then the former will be frozen and only the latter will be trained. In any case, you want to do this before registering the has_grad_hook, which happens a few lines later.

If empty communication groups are an issue (and they probably are), then you need to add a flag to TaskQueueManager.get_distributed_groups that makes sure that the keys encoder, decoder, src_emb, and tgt_emb are empty before returning my_distributed_groups. Either don't populate them, or clear them before returning.

It may also be a good idea to prevent some of the sub-optimizers from being created (in utils/optimizers.py attention_bridge_optimizer), but that may not even be necessary (I think the optimizers can handle being empty). You can try fist without this and implement it if needed.

Waino avatar Dec 18 '23 08:12 Waino