recommenders
recommenders copied to clipboard
Which metric of model should I focus to stop training. Validation top N accuracy or validation loss ?
During training my model, I am facing some issues to stop training on right time.
Image-1 Green -- validation , pink -- training
Image-2 Green -- validation , pink -- training
As you can see in image-1 training and validation accuracy is increasing continuously, however loss graph (Image-2) seems strange to me. After epoch 20 validation loss started to rise although training loss is still dropping , which seems like overfitting .
My question is , should I train model as long as validation accuracy increases , or stop it as soon as validation loss starts to increase ?
Thanks for answers !
I'd suggest using loss for early stopping. These curves looks like overfitting, but they also suggest that your train and val sets might not follow the same statistics. It could be that your val set is too small or that there is some differences introduced by how you split them.
I'd first confirm your validation set is representative of the whole dataset, then you can be certain of whether you have an overfitting problem.
@patrickorlando thanks for your answer. Validation/training was randomly split with 20/80 ratio randomly . The thing I am wondering is that why accuracy is increasing parallelly with epoch validation loss?
And this behaviour happens for different splits? It does look odd. It is possible for accuracy to increase whilst loss increases but I'm still not sure what is going on.
- What is your batch size and is it the same for train/val sets?
- How many candidates does your dataset have?
- Are you passing the candidate_ids to your loss calculation to remove accidental hits?
Assuming you split your data sensibly, I would always trust the top K metric first - it is most closely correlated with your desired objective.
I observed a similar effect and noticed others have as well. I overcame this with what follows.
- I made sure my splitting strategy was sensible in that my dev (validation) and test (evaluation) sets followed similar distributions. Since my train set is sampled from a historical interaction window up to time T and my dev/test sets are sampled from interactions for a smaller future window after time T, there exists natural interaction drift but not enough to pose issues (in my case).
- I made sure to combat overfitting by adding data, regularization, reducing model capacity, etc.
- I applied candidate_sampling_probability to minimize negative sampling candidate popularity bias. TFRS implements substantial functionality discussed in this research paper and this topic is outlined both, theoretically and practically.
- I used a multi-task network minimizing a weighted combination of retrieval and ranking loss.
@deychak Thanks for pointing us to candidate_sampling_probability
. What is an example of how you populate the transit for that parameter?
@rlcauvin I recommend (pun intended) to read the research above - TL;DR - I computed the distribution over candidates based on how often they occurred in my interaction training set.
query, candidate, candidate sampling prob
x, a, 0.5
y, a, 0.5
z, b, 0.25
y, c, 0.25
Thanks, @deychak. I did see in the research paper that we can use the candidate frequencies from the training data. My question was more around the structure of the tensor passed to the call
method.
The tensor's shape is [batch_size, 1]
or [batch_size]
where each row in the training set is a query-candidate interaction pair of which each candidate has a precomputed candidate sampling probability.
Thanks @deychak for your contribution (Y)