lightfm icon indicating copy to clipboard operation
lightfm copied to clipboard

Error build_user_features.

Open sbarbone opened this issue 5 years ago • 13 comments

Hello, im trying to create an user_feature matrix but i cant.

data = Dataset() data.fit(users.user_id.unique(),items.movie_id.unique())

ad_subset = users[["sex_F", 'age','occupation_administrator']] ad_list = [list(x) for x in ad_subset.values] ad_tuple = tuple(zip(users['user_id'], ad_list))

user_features = data.build_user_features(ad_tuple)

I excetude the last code but i got this error

File "C:\ProgramData\Anaconda3\lib\site-packages\lightfm\data.py", line 101, in _process_features "Feature {} not in eature mapping. " "Call fit first.".format(feature)

ValueError: Feature 0 not in eature mapping. Call fit first.

Can anybody help please? Thanks!

sbarbone avatar Mar 14 '19 15:03 sbarbone

Hi @sbarbone,

try adding your features to the Dataset.fit() call! See the method description here dataset.fit and this example.

SimonCW avatar Mar 18 '19 20:03 SimonCW

user_features and item_features parameters aren't well defined in the documentation and I can't find any example who use it. I'm trying to fit with user_features and item_features, but doesn't work.

What if I have a list of user_features and I want to use them all, not just one like in your example.

cdash04 avatar Oct 03 '19 17:10 cdash04

I think you should ùse an iterator, like this one: [x['item_id'], [x['category']]) for x in df.values]

Note that the second value of the tuple above is a list of list...so look this this:

[
  (1, ['horror']),
  (2, ['comedy']),
         ...
]

If your data is a DataFrame, this generator would be like this: [(x[0], [x[1]]) for x in df[['user_id', 'category_id']].values]

igorkf avatar Nov 14 '20 18:11 igorkf

I solved @cdash04's question as follows

item_features = ['Action', 'Adventure', 'Animation', ... snip ... ]
items_features = ((1, ('Animation', ('Children's', 'Comedy')), (2, ('Adventure', 'Children's', 'Fantasy')), ... snip ... )
dataset = Dataset()
dataset.fit(
    ... snip ...
    items=tuple(df_movies['movieId']),
    item_features=item_features,
)
item_features_list = dataset.build_item_features(items_features)

Now, my question is whether the feature has to be a categorical variable.

I would like to add a vector of 8 dimensions as a feature of item. For example.

items_features = ((1, (0.0036, 0.0006, 0.050, 0.02, 0.009, 0.0022, 0.0955, 0.8179), ... snip ... )
item_features_list = dataset.build_item_features(items_features)

However, I get an error with build_item_features, what can I do? Feature (0.0036, 0.0006, 0.050, 0.02, 0.009, 0.0022, 0.0955, 0.8179) not in eature mapping. Call fit first.

Each element of the vector is a real number, not a categorical one. Can't I use such a feature? If you could use it, what definition would you need for item_features?

@SimonCW and @igorkf would be happy to answer your questions.

its-ogawa avatar Nov 17 '20 10:11 its-ogawa

user_features and item_features parameters aren't well defined in the documentation and I can't find any example who use it. I'm trying to fit with user_features and item_features, but doesn't work.

What if I have a list of user_features and I want to use them all, not just one like in your example.

Here is a link to the interesting documentation : particularly you will be interested in 'build_item_features' https://making.lyst.com/lightfm/docs/examples/dataset.html

MchlUh avatar Feb 15 '21 17:02 MchlUh

Here is a link to the interesting documentation : particularly you will be interested in 'build_item_features' https://making.lyst.com/lightfm/docs/examples/dataset.html

The official documentation is very well organized. I refer to it as well.

By the way, can @MchlUh answer my question? I want to make a vector (or real number) into a FEATURE.

I'm hoping for some good ideas.

its-ogawa avatar Feb 16 '21 01:02 its-ogawa

Here is a link to the interesting documentation : particularly you will be interested in 'build_item_features' https://making.lyst.com/lightfm/docs/examples/dataset.html

The official documentation is very well organized. I refer to it as well.

By the way, can @MchlUh answer my question? I want to make a vector (or real number) into a FEATURE.

I'm hoping for some good ideas.

If I understand well, in you example : (0.0036, 0.0006, 0.050, 0.02, 0.009, 0.0022, 0.0955, 0.8179) is the item feature associated to item 1. You need to input it as a list, so I would suggest trying :

items_features = ((1, [0.0036, 0.0006, 0.050, 0.02, 0.009, 0.0022, 0.0955, 0.8179], ... snip ... ) item_features_list = dataset.build_item_features(items_features)

About your question that it is a real valued vector of features : here the vector is 8-dimensional. The model will interpret it as 8 separated features, and I guess you will get the results your are hopping for.

MchlUh avatar Feb 16 '21 09:02 MchlUh

@MchlUh Thank you for your comments.

I tried changing from tuples to lists, but unfortunately I did not get the expected result. (The result is the same)

However, I think this was a very poor explanation on my part. I have a problem with feature mapping, which the first person also mentioned.

items_features = ((1, [0.0036, 0.0006, 0.050, 0.02, 0.009, 0.0022, 0.0955, 0.8179], ... snip ... )
item_features_list = dataset.build_item_features(items_features)

In the same example as above, if you do build_item_features, you will get the following error

ValueError: Feature 0.0036 not in feature mapping. Call fit first.

This means that you should use the fit method of the dataset to set the appropriate item_features.

cf https://making.lyst.com/lightfm/docs/examples/dataset.html

If the item_features is a "category variable", you can specify it as an item_feature in the dataset. For example

dataset.fit(
    ... snip ...
    items=tuple(df['id']),
    item_features = ['Group1', 'Group2', 'Group3', ],
)

However, what I want to specify is a "quantitative variable". If it is a "quantitative variable", how should I fit it into the database?

its-ogawa avatar Feb 17 '21 10:02 its-ogawa

@its-ogawa Thanks for the details, I understand what you are trying to do ! I haven't tried it myself, but for now I can make this simple suggestion : could you bin you quantitative variable to turn it into a categorical one ? For example with pandas.cut

MchlUh avatar Feb 17 '21 11:02 MchlUh

@MchlUh

Thank you for your comment.

I do believe it is a good idea!

In fact, I computed the quartiles and, depending on which range the quantitative variable is included in ['Q1', 'Q2', 'Q3', 'Q4'], depending on which range the quantitative variable falls into.

By doing so, I can fit the item_features into a dataset.

dataset = Dataset()
dataset.fit(
    ... snip ...
    items=tuple(df['id']),
    item_features = ['Q1','Q2','Q3','Q4'].
)

items_features = ((1, ['Q1', 'Q1', 'Q3', 'Q2', 'Q1', 'Q1', 'Q3', 'Q4'], ... snip ... )
item_features_list = dataset.build_item_features(items_features)

By doing this, you have solved the problem of feature mapping. Thank you very much.


This may sound very strange, but I feel that converting real values into four categorical variables is reducing the expressive power of the features.

I was wondering if it would be possible to perform the analysis while maintaining the expressive power, and I was wondering if it would be possible to use "quantitative variables" as feature values. Is this an impossible problem? For example, is it possible to solve the problem by dividing the data into finer categories instead of quartiles?

its-ogawa avatar Feb 18 '21 02:02 its-ogawa

@its-ogawa I'm a quite curious, what kind of feature is it ? Could you provide me with a resembling example to help me contextualise ?

MchlUh avatar Feb 18 '21 10:02 MchlUh

@its-ogawa , I think there is a trade-off when binning quantitative variables. Sure, you loose some expressive power but also think about it that way: If you take each original value as a feature, only very few users or maybe only one user will have interacted with each feature. In turn, there isn't much "collaborative information" from which the model can learn.

The other extreme, where almost all interactions fall into one or two categories is also not very helpful. I would recommend to experiment with this trade-off a bit.

SimonCW avatar Feb 20 '21 12:02 SimonCW

@MchlUh @SimonCW Thank you for your interest in my question.

@MchlUh The 8-dimensional vector here is It is the percentage of a user's classification into 8 characteristics (assuming that there were such things). It is very interesting to me that each user's identity is expressed a little or noticeably in the 8 features. This is very interesting to me.

I was thinking of using this 8-dimensional vector to measure the similarity of each user. I came up with a way to use it as a feature of LightFM. The result is shown above.

My concern is that it might be less expressive. Simply because I thought it would be a reduction from real values to four categories.

@SimonCW Thanks for the advice.

I think what you say is very true.

In making recommendations using "collaborative information", the implication is to increase the number of interacting users. It makes sense to me that you are using binning to increase the number of users who interact with each other. It is very intuitive to assume that users who belong to the same category are similar.

However, it is not a simple collaborative filtering. Could we draw the power of LightFM to learn features? Or is it a good approach to solve the problem? This is the question we are asking ourselves.

its-ogawa avatar Feb 22 '21 01:02 its-ogawa