implicit icon indicating copy to clipboard operation
implicit copied to clipboard

different loss from GPU

Open benyaminjami opened this issue 4 years ago • 1 comments

I trained two ALS models for the same data on GPU and CPU, but for GPU, I have 0.06 for the loss, and in CPU, I have 0.0006 for loss. Their results are completely different.

benyaminjami avatar Jan 23 '21 10:01 benyaminjami

We are having the same problem when we use_gpu=True our music recommendations blow up and give us random noise but when we train using the CPU (9 hours) then our ALS model makes good predictions. Not sure what to do.

dougturnbull avatar Aug 03 '22 19:08 dougturnbull

Hello, do we have any news regarding this topic or possible workaround? I'm observing the same behaviour on v0.6.2. Thank you!

josumsc avatar Jan 09 '23 10:01 josumsc

Hello @benfred We are observing the same behavior as described above: different loss when training on gpu and cpu, very different score values and differences in the top recommended items.

We see that you write in issue 367 that GPU loss calculation might be buggy - any updates regarding this? Or advice going forward?

Thank you!

Details:

#GPU
model_gpu = implicit.als.AlternatingLeastSquares(factors=128, alpha=0.003, regularization=100, iterations=15, use_gpu = True, calculate_training_loss = True)
model_gpu.fit(user_ratings) 

#CPU
model_cpu = implicit.als.AlternatingLeastSquares(factors=128, alpha=0.003, regularization=100, iterations=15, use_gpu = False, calculate_training_loss = True)
model_cpu.fit(user_ratings)

userid = 94943
id_gpu, scores_gpu = model_gpu.recommend(userid, user_ratings[userid], N = 50, filter_already_liked_items=False)
id_cpu, scores_cpu = model_cpu.recommend(userid, user_ratings[userid], N = 50, filter_already_liked_items=False)

Giving scores: Skjermbilde 2023-01-13 kl  14 36 56

linncecilie avatar Jan 13 '23 13:01 linncecilie

#663 should fix the problems with the calculate_training_loss flag showing incorrect results reported by @benyaminjami .

However, this won't fix the issue reported by @linncecilie =(. The training loss calculation is only a diagnostic metric, and doesn't impact the learned parameters. @linncecilie do you have any more information to help diagnose this (sample dataset, or reproducer on a public dataset etc) ? The movielens and lastfm examples included in this repo both give the same results for me - and I'm not seeing huge differences in scores on the model.recommend call like you have there.

benfred avatar Jun 06 '23 21:06 benfred