lightfm Hybrid model have lower Precision@K compare to pure CF

Hybrid model have lower Precision@K compare to pure CF

Open kientt15vinid opened this issue 4 years ago • 13 comments

Hi Maciej,

I'm testing LightFM for my recommendation system in e-commerce of grocery products (everything you could buy in a convenient store). I've tested LightFM hybrid to pure collarborative filtering (also LightFM, just without users and item features) and got smaller preciesion@10. I've read your paper and it seems to point out that hybrid model should outperform pure CF model, but my experiment get the opposite results.

Here are the descriptions of my approach:

Dataset

Interaction matrix: 9915 x 17199; 98% sparsity (or ~2% density)
- Purchase data of 9915 users across 17199 items.
- All users have at least 1 transaction during the sample period
User features matrix: 9915 x 9930
- Including an identity matrix and 15 additional features on age, gender, geographic regions, etc
Item features matrix: 17199 x 21007
- Including an identity matrix and 3808 features based on brand name, categories, product descriptions

Implementation:

Train-Test data split by timestamp (because 1 user might re-purchase an item) then interaction matrix was built using lighfm.data.build_interaction()
Training and evaluation:

model = LightFM(loss='warp',
                no_components=80,
                item_alpha= 1e-7,
                learning_rate = 0.02,
                max_sampled = 50)

Hybrid model

model_hybird = model.fit(train, 
                      item_features=item_features,
                        user_features = user_features,
                epochs = 80,
                num_threads = 4)

test_precision = precision_at_k(model_hybrid, test, item_features = item_features,
                                    user_features = user_features, num_threads = 4, k= 10).mean()

Pure CF model

model_simple = model.fit(train, 
                epochs = 80,
                num_threads = 4)

test_precision = precision_at_k(model_simple, test, num_threads = 4, k= 10).mean()

Results: hybrid_precision@10 = 0.057814, pureCF_precision@10 = 0.070189. I've tried several things to try to increase the test precision of hybrid model including: Use weight matrix for training, optimize hyper parameter using grid search, normalize user features and item features, calculate weight for user features and item features with TFIDF. But so far the results always have pure CF outperform the hybrid model.

Would appreciate any advice on this. Thank you!

Aug 20 '19 03:08 kientt15vinid

I got similar results to yours.

While AUC gives similar results between pure and hybrid models, precision@10 and recall@10 show much better performance on pure CF than Hybrid CF (up to +50%!).

In my case, I'm only using just a few item_features gathered from catalogue (category, item gender, item color, etc).

The strange thing is that when I'm using the item representation to calculate the item similarities, the hybrid model easily outperforms the pure CF one in this task (similarities make more sense and produce higher CTR% when deployed online).

Given the latter result, I was also expecting much higher performance on user suggestion task.

Nov 26 '19 09:11 FrancescoI

@FrancescoI may I ask what do you use to evaluate the performance of the item similarities model?

Nov 26 '19 10:11 kientt15vinid

@kientt15vinid, we've AB tested both the models online (a carousel of "similar items" to the main one in item page), using CTR % (click through rate: number of clicks / number of impressions) as primary KPIs.

Nov 26 '19 11:11 FrancescoI

@FrancescoI In that case, CTR might be motivated by the desire to explore different variances of an item. I suggest you could use a pure content-based model as baseline. If the hybrid model out-play both pure CF and pure Content-based, then it is something worth noting.

Nov 27 '19 02:11 kientt15vinid

@kientt15vinid , yeah, CB was the baseline even before switching to CF!

By switching from CB to CF we achieved tremendous uplift in CTR (up to +47%). I haven’t mentioned because I was focused on pure vs hybrid CF comparison :)

Nov 27 '19 06:11 FrancescoI

We get similar results in our initial experiments, adding item meta-data (and keeping the identity matrix from pure CF) leads to worse MRR and P@10 than a pure CF Model. We'll keep investigating and report back. It would also be interesting to here back from the others @FrancescoI , @kientt15vinid

Feb 26 '20 15:02 SimonCW

The culprit here might be that the embeddings for all features of a given item are simply summed to get the final item embedding: the model does not seem to be great at learning which features are important and which are not.

In a more flexible formulation you may want to concatenate the embeddings of different features to get your embedding vector. This should make it more straightforward for the model to simply discard some features.

Experimenting with different weights for different types of features might give you a lever to optimize this.

Mar 03 '20 04:03 maciejkula

@maciejkula, could you expand your thought on this?

So far in my experimentations, I'm using the norm of the feature vector in the embedded space as signal that the model is actually learning something useful from the feature itself.

For instance item gender (in fashion industry) is arguably the most important item metadata, it has the greatest vector norm and it really helps separate mixed-gender items in the final product embedding.

On the contrary when a new feature vector norm is small, since its marginal contribution to the product embedding is negligible, the feature may be dropped.

Does it make sense to you?

Mar 03 '20 08:03 FrancescoI

We get similar results in our initial experiments, adding item meta-data (and keeping the identity matrix from pure CF) leads to worse MRR and P@10 than a pure CF Model. We'll keep investigating and report back. It would also be interesting to here back from the others @FrancescoI , @kientt15vinid

I'm still struggling to find robust evidence since during my experimentations model performances were really sensitive to small changes in the hyperparameters and pure CF vs CF+metadata needed really different configurations to reach their highest performance.

In the end we decided to keep using the metadata: while we haven't proved it to be best model among all the configurations we tested, we found it the best model to produce reliable item similarity over time.

Mar 03 '20 08:03 FrancescoI

@FrancescoI the norm sounds like a very good approximation to feature importance. My hope was that this would work reliably, but I think in practice it's not always the case. It's possible that L2 regularization is really important to make sure that rare features are pushed to zero norm so as to not introduce noise.

Mar 07 '20 20:03 maciejkula

@FrancescoI could you elaborate on how you compute item similarity from the CF + metadata model? I would be interested in doing the same, maybe you can give me some pointers?

Mar 24 '20 07:03 riccardopinosio

@FrancescoI could you elaborate on how you compute item similarity from the CF + metadata model? I would be interested in doing the same, maybe you can give me some pointers?

Every item embedding is the sum of itself plus its metadata embedding. There's a method in the lightfm object instance that automatically performs this calculation.

Than you just need to pick an appropriate distance measure to retrieve the nearest items to each other: if I'm not wrong the package documentation has a nice example also for this use case.

Mar 25 '20 09:03 FrancescoI

@FrancescoI could you elaborate on how you compute item similarity from the CF + metadata model? I would be interested in doing the same, maybe you can give me some pointers?

Every item embedding is the sum of itself plus its metadata embedding. There's a method in the lightfm object instance that automatically performs this calculation.

Than you just need to pick an appropriate distance measure to retrieve the nearest items to each other: if I'm not wrong the package documentation has a nice example also for this use case.

@FrancescoI I'm trying to find the method that you're referring to where it sums up the metadata embedding. Would you be able to point me in to the function you're referring to? Would really help me out. Thanks

Apr 05 '21 16:04 simongiles1

lightfm lightfm copied to clipboard

Hybrid model have lower Precision@K compare to pure CF

Dataset

Implementation:

Would appreciate any advice on this. Thank you!

lightfm
lightfm copied to clipboard