COMET [QUESTION] Why does training speed go down?

I noticed that comet-train (after encoder finetuning) has speed of ~12it/s at e.g. 30% which drops to ~7it/s at 60% and to ~6it/s at 90% of the epoch.

Is that something particular to only me or did anyone else observe this as well?
If yes, is this expected behaviour?

I'm using NVIDIA A10G GPUs and the following software versions:

Python - 3.10.9
COMET - upstream
torch - 2.0.1
pytorch-lightning - 1.9.5
transformers - 4.29.0
numpy - 1.24.3

Aug 05 '23 14:08 zouharvi

Hi zouharvi,

I noticed this behavior as well. I think it has something to do with "Encoder model fine-tuning". After this the speed gradually decreases for me from 13.98it/s to 5.85it/s at the end of the epoch.

Could someone comment if this is an excepted behavior?

Aug 08 '23 07:08 maxiek0071

Indeed, without encoder fine-tuning (nr_frozen_epochs=1), this does not happen. Shot in the dark: I wonder if there is some memory leak associated with that which leaves some grad-able objects on the GPU?

Aug 08 '23 13:08 zouharvi

hmmm and what happens on the second epoch? I actually never noticed this...

Aug 16 '23 11:08 ricardorei

In the second and the next epochs it converges to ~5it/s for me (A10G with batch size 6).

Aug 16 '23 12:08 zouharvi

Hi, I trained two reference-free QE models on in-domain data with 300k segments. One with nr_frozen_epochs=0.3 (as proposed in the config in this repo), and the other with nr_frozen_epochs=1. The rest of the parameters stayed the same. The True Positive Rate of the prediction is lower by about 10% when using nr_frozen_epochs=1. So the model where the encoder-fine tuning takes place later, leads to worse performance. The training was indeed faster until the first epoch, after this the "Encoder model fine-tuning" took place (as intended).

Aug 16 '23 14:08 maxiek0071