thinc icon indicating copy to clipboard operation
thinc copied to clipboard

cannot use scheduler for grad_factor

Open violenil opened this issue 4 years ago • 1 comments

In my model implementation, I would like to freeze the transformer (using roberta-base in a Tok2VecTransformer.v1) for the first 2 epochs during training. From this spacy documentation, it seems like it should be possible to set the grad_factor to 0 in order to disable gradients from one of the listeners. Setting this up per epoch should then be possible, according to the same documentation, by using a scheduler. In my config, I have specified the constant_then scheduler followed by another constant scheduler in the following way:

[components.seq2labels.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecTransformer.v1"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

[components.seq2labels.model.tok2vec.grad_factor]
@schedules = "constant_then.v1"
rate = 0.0
steps = 2000

[components.seq2labels.model.tok2vec.grad_factor.schedule]
@schedules = "constant.v1"
rate = 1.0

When initializing, I get the following error:

=========================== Initializing pipeline ===========================
✘ Config validation error
seq2labels.model.tok2vec -> grad_factor   value is not a valid float

It seems to me that the scheduler may be returning and iterator instead of a float that can be used as a value here. Have I overlooked some aspect that should still be implemented/ammended?

Otherwise, if this scheduler does not work with grad_factor, is there another way to freeze the transformer only for the first 2 epochs of training?

Thanks for any help in advance :)

violenil avatar Aug 19 '21 09:08 violenil

This is basically because grad_factor isn't designed to take a sequence of values, like an iterator, as you note. That's not just an oversight, the transformers model isn't designed to support a sequence there at the moment.

If you look at a place where the value can be a sequence or float, like the learn rate in Adam, you'll see that the type is annotated as FloatOrSeq. In contrast, grad_factor is just a float.

This also isn't just a type issue - the implementation of the Transformer architecture would need to be changed to work with non-constant values. Looking at it I don't think it would be complicated.

I've wanted this feature myself when training models before, so I think we could certainly consider adding it.

polm avatar Aug 29 '21 12:08 polm