lightfm icon indicating copy to clipboard operation
lightfm copied to clipboard

Correct way of creating Item/User features with Dataset class

Open bonobo opened this issue 5 years ago • 8 comments

Hi Maciej and everyone else :),

I am using LightFM in my school project with Yelp Academic dataset. I've looked at some previous issues, but I think that none of them were specifically describing what I was looking for (If I'm wrong, sorry for duplicate).

So, I want to incorporate item/user features and create them with Dataset class, but I don't know if I'm doing it right (I have created some and everthing seems working, but I don't know it is correct), because in Yelp dataset there are various types of feature values, lot of them are just True/False some are in given range or continuous e.g. price range, or opening hours and also categorical.

Currently I am creating or preparing features to be in collection of (item id, [list of feature names]).

Let's say I want to create features from columns price_range (range 1-5), accept_credit_cards (bool), smoking_allowed (bool), category (str). The prepared collection of tuples for example:

[
  (item1, [1, False, True, bar],
  (item2, [4, True, False, restaurant],
  (item3, [3, True, True, burgers],
  ...
]

My questions:

  1. Is this way correct or not?
  2. Will be position of True/False values from above taken into account or not (treated like values of two or more features)? Becaues I thing when passing all possible feature values to fit method of Dataset they will not.
  3. Should I use second method which is described in docs ((user id, {feature name: feature weight}), but what than with categories?)
  4. This one isn't related to my Dataset issue, but what are "sane" parameters when tuning performance of model (learning rate, components, epochs...), because one of my colleague told me that he doesn't use more than 40 components or epochs.

Thanks and have a nice day!

bonobo avatar Nov 11 '18 22:11 bonobo

  1. This is partially, but not entirely correct.
  2. The position does not matter. Your features should look similar to the following:
[
    (item1, 'price:1', 'accept_credit_cards:False', 'smoking_allowed:True', 'category:bar'),
]
  1. Start with something that easy and quick to fit, like 32 or 64 components and 10 epochs, and go from there. You will quickly get an intuition for what works and what does not; follow this up with a fuller hyperparameter search before deploying to production.

Hope this helps!

maciejkula avatar Nov 13 '18 04:11 maciejkula

Thank you for quick response!

Just to be sure, the correct "shape" of features passed to build_item_features/build_user_features should be (I think you forgot to put features into list in your response):

[
    (item1, ['price:1', 'accept_credit_cards:False', 'smoking_allowed:True', 'category:bar']),
    (item2, ['price:4', 'accept_credit_cards:True', 'smoking_allowed:False', 'category:restaurant']),
]

Have a nice day!

bonobo avatar Nov 13 '18 11:11 bonobo

@bonobo It's been a while, but I found myself in exact same situation as yours, what did eventually end up working for you?

Viveckh avatar May 02 '20 21:05 Viveckh

@bonobo so I have to have ['accept_credit_cards:True', 'smoking_allowed:False', 'category:restaurant'] as my item_features on my dataset.fit method too?

ahmadalli avatar Jun 17 '20 21:06 ahmadalli

@bonobo so I have to have ['accept_credit_cards:True', 'smoking_allowed:False', 'category:restaurant'] as my item_features on my dataset.fit method too?

@ahmadalli Yup, I think that's correct (at least that's what works for me). All the combinations should be added to the item_features list. This is how my item_features look like for 3 boolean features f1, f2, f3, and one additional location feature (which can take two values for now - UK and USA): item_features = ['f1:True', 'f1:False', 'f2:True', 'f2:False', 'f3:True', 'f3:False', 'loc:UK', 'loc:USA']

I have explained it in more detail using a worked example in this article: https://towardsdatascience.com/how-i-would-explain-building-lightfm-hybrid-recommenders-to-a-5-year-old-b6ee18571309

V-Sher avatar Jul 04 '20 14:07 V-Sher

Ah, I spent 6 evenings trying to figure out what's wrong with the format I used 😁 Documentation says: (item id, {feature name: feature weight})

@maciejkula , could you kindly update the documentation? https://making.lyst.com/lightfm/docs/lightfm.data.html

Big thank you @maciejkula for developing this package!

Konstantin-Orlovskiy avatar Aug 13 '20 19:08 Konstantin-Orlovskiy

My item_feature only one column, with 74 unique value .

 print(item_feature['product_category_name'].unique().astype(str))

>>> ['62' '3' '32' '9' '73' '45' '26' '54' '28' '12' '13' '25' '44' '11' '50'
 '40' '55' '8' '30' '34' '70' '59' '33' '61' '16' '66' '21' '63' '31' '0'
 '72' '57' '68' '19' '20' '48' '22' '39' '38' '53' '43' '71' '23' '49'
 '29' '5' '10' '51' '46' '24' '36' '14' '7' '2' '58' '1' '69' '47' '64'
 '35' '6' '37' '27' '4' '60' '56' '18' '42' '41' '15' '65' '67' '52' '17']

I thought the shape would be 32951x74

<32951x74 sparse matrix of type '<class 'numpy.int64'>'
	with 32951 stored elements in Compressed Sparse Row format>

But dataset.build_item_features got 32951x33025

dataset.fit(ratings.customer_id, ratings.product_id, 
            item_features=item_feature['product_category_name'].unique().astype(str))

a = list(zip(item_feature['product_id'].values, item_feature.loc[:, item_feature.columns.difference(['product_id'])].values))
a = [(x[0], list(map(lambda q: f'{q}',  x[1])) ) for x in a]
item_features = dataset.build_item_features(a)
print(repr(item_features))

>>> <32951x33025 sparse matrix of type '<class 'numpy.float32'>'
	with 65902 stored elements in Compressed Sparse Row format>

eromoe avatar Feb 09 '22 14:02 eromoe

item_features = ['f1:True', 'f1:False', 'f2:True', 'f2:False', 'f3:True', 'f3:False', 'loc:UK', 'loc:USA']

HI, in this case this location features is associated to the item? An example is item with ID 19392 and location UK, right?

I have explained it in more detail using a worked example in this article: https://towardsdatascience.com/how-i-would-explain-building-lightfm-hybrid-recommenders-to-a-5-year-old-b6ee18571309

In your article you talk about user_features, I have a question for you. I don't have user features but I have item features: you said that I can use the same process with 'item_features' ? My item features are price, category, brand, etc etc ... What should I expect? A more precise model that considers the item_features during the training phase?

Michele971 avatar Aug 28 '22 10:08 Michele971