[QUESTION] Why does training speed go down?
I noticed that comet-train (after encoder finetuning) has speed of ~12it/s at e.g. 30% which drops to ~7it/s at 60% and to ~6it/s at 90% of the epoch.
- Is that something particular to only me or did anyone else observe this as well?
- If yes, is this expected behaviour?
I'm using NVIDIA A10G GPUs and the following software versions:
- Python - 3.10.9
- COMET - upstream
- torch - 2.0.1
- pytorch-lightning - 1.9.5
- transformers - 4.29.0
- numpy - 1.24.3
Hi zouharvi,
I noticed this behavior as well. I think it has something to do with "Encoder model fine-tuning". After this the speed gradually decreases for me from 13.98it/s to 5.85it/s at the end of the epoch.
Could someone comment if this is an excepted behavior?
Indeed, without encoder fine-tuning (nr_frozen_epochs=1), this does not happen. Shot in the dark: I wonder if there is some memory leak associated with that which leaves some grad-able objects on the GPU?
hmmm and what happens on the second epoch? I actually never noticed this...
In the second and the next epochs it converges to ~5it/s for me (A10G with batch size 6).
Hi, I trained two reference-free QE models on in-domain data with 300k segments. One with nr_frozen_epochs=0.3 (as proposed in the config in this repo), and the other with nr_frozen_epochs=1. The rest of the parameters stayed the same.
The True Positive Rate of the prediction is lower by about 10% when using nr_frozen_epochs=1. So the model where the encoder-fine tuning takes place later, leads to worse performance.
The training was indeed faster until the first epoch, after this the "Encoder model fine-tuning" took place (as intended).