thinc
thinc copied to clipboard
cannot use scheduler for grad_factor
In my model implementation, I would like to freeze the transformer (using roberta-base in a Tok2VecTransformer.v1) for the first 2 epochs during training. From this spacy documentation, it seems like it should be possible to set the grad_factor to 0 in order to disable gradients from one of the listeners. Setting this up per epoch should then be possible, according to the same documentation, by using a scheduler. In my config, I have specified the constant_then scheduler followed by another constant scheduler in the following way:
[components.seq2labels.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecTransformer.v1"
name = "roberta-base"
tokenizer_config = {"use_fast": true}
[components.seq2labels.model.tok2vec.grad_factor]
@schedules = "constant_then.v1"
rate = 0.0
steps = 2000
[components.seq2labels.model.tok2vec.grad_factor.schedule]
@schedules = "constant.v1"
rate = 1.0
When initializing, I get the following error:
=========================== Initializing pipeline ===========================
✘ Config validation error
seq2labels.model.tok2vec -> grad_factor value is not a valid float
It seems to me that the scheduler may be returning and iterator instead of a float that can be used as a value here. Have I overlooked some aspect that should still be implemented/ammended?
Otherwise, if this scheduler does not work with grad_factor, is there another way to freeze the transformer only for the first 2 epochs of training?
Thanks for any help in advance :)
This is basically because grad_factor isn't designed to take a sequence of values, like an iterator, as you note. That's not just an oversight, the transformers model isn't designed to support a sequence there at the moment.
If you look at a place where the value can be a sequence or float, like the learn rate in Adam, you'll see that the type is annotated as FloatOrSeq. In contrast, grad_factor is just a float.
This also isn't just a type issue - the implementation of the Transformer architecture would need to be changed to work with non-constant values. Looking at it I don't think it would be complicated.
I've wanted this feature myself when training models before, so I think we could certainly consider adding it.