sparseml icon indicating copy to clipboard operation
sparseml copied to clipboard

BaseScheduled has start_epoch/end_epoch loaded as strings

Open eldarkurtic opened this issue 3 years ago • 0 comments

Describe the bug Saving a transformers model that has been trained with a modifier that has for example start_epoch: 0.000001 breaks with the following exception: TypeError: '<' not supported between instances of 'str' and 'int'

This happens because when this line is reached https://github.com/neuralmagic/sparseml/blob/7a5971c2ecd26c45fbd060b9eaf3fcfe3c9efd94/src/sparseml/transformers/sparsification/trainer.py#L459 the str call will convert values like 0.000001 into scientific format 1e-6 and thus when BaseManager.compose_staged is invoked the constructor of BaseScheduled will receive the string 1e-6 instead of the float 0.000001 and the call to validate the loaded schedule will fail here https://github.com/neuralmagic/sparseml/blob/7a5971c2ecd26c45fbd060b9eaf3fcfe3c9efd94/src/sparseml/optim/modifier.py#L714 because the start_epoch is a string.

To Reproduce Exact steps to reproduce the behavior: load a pretrained transformer model and try to finetune it with


training_modifiers:
  - !EpochRangeModifier
    start_epoch: 0
    end_epoch: 13

  - !TrainableParamsModifier
    params:
      - re:bert.encoder.layer.*.attention.self.query.weight
      - re:bert.encoder.layer.*.attention.self.key.weight
      - re:bert.encoder.layer.*.attention.self.value.weight
      - re:bert.encoder.layer.*.attention.output.dense.weight
      - re:bert.encoder.layer.*.intermediate.dense.weight
      - re:bert.encoder.layer.*.output.dense.weight
    trainable: False
    params_strict: True
    start_epoch: 0.00001
    end_epoch: 1

and use something like: --save_strategy steps and --save_steps 2 to quickly reproduce the exception.

eldarkurtic avatar Sep 04 '22 12:09 eldarkurtic