lightfm icon indicating copy to clipboard operation
lightfm copied to clipboard

Including item features seems to reduce performance.

Open jemmott opened this issue 3 years ago • 12 comments

First, it is totally possible that I am misunderstanding something basic or have a bug in my code.

But I am consistently finding that adding item features actually reduces performance compared with collaborative filtering.

I first did the analysis on some internal data, but reproduced it with a public example to share here. Here is a notebook with an example on goodreads data: https://github.com/jemmott/lightfm-goodbooks-debug

The punch line is I looked an implicit example - trying to predict if a user will rate a book. I used mean reciprocal rank (MRR) as the metric, but results were similar for R@K and P@K. Performance is significantly reduced when I include authors as an item feature when compared with no item features (pure collaborative filtering). I did not explore user features.

On a hunch I decided to test something kind of strange. I decided to shuffle the item features - permuting them so that they are randomly assigned to each item. I then trained and cross validated LightFM, and found the change in MRR. I repeated that 100 times, and drew a histogram of the results, shown in blue below. The x axis is percent change from CF. The red line is the result with the actual item assignments.

image

What this tells me is that not only are the item features actually reducing the performance, but the actual (non shuffled) item features are on average no better at predicting than randomly shuffled ones. This seems really bad.

I also tried an example where I included the item ids as item features to add the identity matrix back in. Performance was still worse than pure CF (no features), but did improve slightly.

It seems like I am not alone with this - here are two other examples of people seeing worse performance when adding item features:

  • https://www.ethanrosenthal.com/2016/11/07/implicit-mf-part-2/
  • https://towardsdatascience.com/recommendation-system-part-1-use-of-collaborative-filtering-and-hybrid-collaborative-content-in-6137ba64ad58

Anyone know what is going on here?

jemmott avatar Aug 07 '20 19:08 jemmott

You might find something useful in these issues: #497 , #486 , #430

SimonCW avatar Sep 17 '20 09:09 SimonCW

@jemmott did you manage to solve this problem?

I too am struggling with drastic performance degradation (in P@K) when using item features, can't say the mentioned issues helped me much, but that is likely because I'm new to the whole recommendation system world.

pdavis156879 avatar Dec 06 '20 19:12 pdavis156879

No, no real progress.

I have tasked some students with doing a comparison of LightFM against some other baselines as a class project, but I am not sure if any of them are using user or item features. If so, I will update.

I also did some user interviews, and based on the results we will be using LightFM, though not with user or item features.

jemmott avatar Dec 07 '20 15:12 jemmott

tackled the same problem.. if someone finds a good answer please share

guyba-tr avatar Dec 10 '20 11:12 guyba-tr

Has anyone actually seen an improvement on real data using it? i.e is the lightFM implementation possibly broken entirely?

ddofer avatar Dec 30 '20 15:12 ddofer

The implementation isn't broken.

It is, however, very simple: the model simply averages the embeddings of all the features it is given. Because of the averaging, the model is incapable of figuring out which features are uninformative and ignoring them.

Consequently, if you add lots of uninformative features they will degrade your model by diluting the information provided by your good features. To prevent this, you may have to adopt more sophisticated models whose implementations are not offered by LightFM.

Note also that metadata features are likely to improve performance only on very sparse datasets, or sparse (long tail, cold-start) subsets of your data.

maciejkula avatar Dec 30 '20 18:12 maciejkula

Did you add the identity matrix to you features matrix? At least I missed that in the beginning and got worse performance when including features.

HenrikRehbinder avatar Apr 29 '21 09:04 HenrikRehbinder

Yes, I tested both with and without the identity matrix. Adding the identity matrix helped, but still gave worse performance than no features at all.

Based on the feedback from maciejkula above, I don't think I am seeing the problem where I am adding a ton of uninformative features - in the goodreads example I only included the author as a feature, which I expect would be a very strong signal. So it must be the last line - that performance is only improved in very sparse data.

For what it's worth, we ended up with a hybrid architecture, where LightFM does the CF part, and there is also a feature-based recommender, and the results are combined. We are also exploring the TensorFlow recommender library, which also has the ability to include CF + features (and more).

jemmott avatar Apr 29 '21 13:04 jemmott

I too was having performance issues when I added features to the model. The model performed better for users who had (in my case) fewer than 10 interactions in the training set, but performed poorly as the number of interactions increased.

What helped was giving a weight to each feature. The weights were obtained by training a random forest (using sklearn) on the data and outputting the model.feature_importances_. Also, discretising numerical features into bins achieved better results compared with simply using the value as the feature weight. This approach also allows you to include the feature importance as the feature weight, as per the above.

furnecol-flutterint avatar May 25 '21 14:05 furnecol-flutterint

We are also exploring the TensorFlow recommender library

Hello @jemmott ! Have you already tested It? How It performs?

Thanks!

almirb avatar Oct 11 '21 18:10 almirb

I too was having performance issues when I added features to the model. The model performed better for users who had (in my case) fewer than 10 interactions in the training set, but performed poorly as the number of interactions increased.

What helped was giving a weight to each feature. The weights were obtained by training a random forest (using sklearn) on the data and outputting the model.feature_importances_. Also, discretising numerical features into bins achieved better results compared with simply using the value as the feature weight. This approach also allows you to include the feature importance as the feature weight, as per the above.

@Furnec what was the dependent variable for your RandomForest model?

shivamtundele avatar Feb 25 '22 19:02 shivamtundele

@Furnec what was the dependent variable for your RandomForest model?

@shivamtundele The RandomForest model predicted whether or not a given user-item pair was ‘positive’, using logistic regression. From memory, I think we omitted user IDs and item IDs from the input data as we only wanted the relative importance of the features. I suspect we also down-weighted / downsampled negative interactions. In hindsight, using SHAP values probably would’ve been better than using Sklearn’s feature_importances_.

This approach is by no means perfect, but it worked sufficiently well for us.

furnecol-flutterint avatar Feb 26 '22 07:02 furnecol-flutterint