partial training of system

Open TimotheeMickus opened this issue 2 years ago • 1 comments

freezing some of the modules would allow training adapter as actual adapters.

Ideally, this would entail introducing some mechanism to mark in the config specific layerstacks/adapaters as not requiring gradient.

To be confirmed, but we can probably just do a combination of the following to get the desired behavior:

leave marked models of all communication groups
not apply the forward has_grad hook to these models
remove them from gradient computations with module.requires_grad_(False)

Sep 29 '23 07:09 TimotheeMickus

Basically you need to set param.requires_grad to false for the modules that should be frozen. If you do this in model_builder.py between creating the NMTModel and the call to create_adapters, then the former will be frozen and only the latter will be trained. In any case, you want to do this before registering the has_grad_hook, which happens a few lines later.

If empty communication groups are an issue (and they probably are), then you need to add a flag to TaskQueueManager.get_distributed_groups that makes sure that the keys encoder, decoder, src_emb, and tgt_emb are empty before returning my_distributed_groups. Either don't populate them, or clear them before returning.

It may also be a good idea to prevent some of the sub-optimizers from being created (in utils/optimizers.py attention_bridge_optimizer), but that may not even be necessary (I think the optimizers can handle being empty). You can try fist without this and implement it if needed.

Dec 18 '23 08:12 Waino