sparseml
sparseml copied to clipboard
BaseScheduled has start_epoch/end_epoch loaded as strings
Describe the bug
Saving a transformers model that has been trained with a modifier that has for example start_epoch: 0.000001 breaks with the following exception:
TypeError: '<' not supported between instances of 'str' and 'int'
This happens because when this line is reached https://github.com/neuralmagic/sparseml/blob/7a5971c2ecd26c45fbd060b9eaf3fcfe3c9efd94/src/sparseml/transformers/sparsification/trainer.py#L459 the str call will convert values like 0.000001 into scientific format 1e-6 and thus when BaseManager.compose_staged is invoked the constructor of BaseScheduled will receive the string 1e-6 instead of the float 0.000001 and the call to validate the loaded schedule will fail here https://github.com/neuralmagic/sparseml/blob/7a5971c2ecd26c45fbd060b9eaf3fcfe3c9efd94/src/sparseml/optim/modifier.py#L714
because the start_epoch is a string.
To Reproduce Exact steps to reproduce the behavior: load a pretrained transformer model and try to finetune it with
training_modifiers:
- !EpochRangeModifier
start_epoch: 0
end_epoch: 13
- !TrainableParamsModifier
params:
- re:bert.encoder.layer.*.attention.self.query.weight
- re:bert.encoder.layer.*.attention.self.key.weight
- re:bert.encoder.layer.*.attention.self.value.weight
- re:bert.encoder.layer.*.attention.output.dense.weight
- re:bert.encoder.layer.*.intermediate.dense.weight
- re:bert.encoder.layer.*.output.dense.weight
trainable: False
params_strict: True
start_epoch: 0.00001
end_epoch: 1
and use something like: --save_strategy steps and --save_steps 2 to quickly reproduce the exception.