recommenders icon indicating copy to clipboard operation
recommenders copied to clipboard

Question: How best to prepare a dataset for sequential recommendation.

Open ydennisy opened this issue 4 years ago • 8 comments

Me again :)

I would like to understand how best to prepare data for a sequential recommender similar to GRU4Rec.

Our data initially can be viewed as just a bunch of sequences of variable length, one sequence per user:

seqs = [
  [a, b, a, c, d, e, d, g],
  ...
]

Now I assume we want to prepare a y or target, so the simplest approach (as I see it) is to:

  • slice off the last element of each item in seqs as the y
  • pad the seqs to a max length

But then there are many ways in which we could make different versions of this dataset, for example from seqs[0], we could actually make many rows by creating copies incrementally predicting the next item.

So my question is:

  • is there any theoretical best practice on how to prep such a dataset?
  • are there are helpers in terms of implementation to make this easier / cleaner within TFRS or a TF data pipeline?
  • edit: is there also a specific way to prepare data and formulate the problem in order to predict not the next sequence value but further ahead?

Thanks in advance!

ydennisy avatar Jul 28 '21 16:07 ydennisy

@maciejkula any feedback here? 😄

ydennisy avatar Aug 06 '21 09:08 ydennisy

These are good questions, I think all the suggestions you make are very sensible.

This is not really a TFRS-specific problem, though - perhaps there is something in the wider TF ecosystem that provides a guide or some handy functionality?

maciejkula avatar Aug 06 '21 16:08 maciejkula

Hmm well fair enough, but it does relate to how TFRS implements recommendation, or at least the tutorials, for example most other approaches on the next have a dense layer on top with softmax to give a probability over the possible next items in a sequence.

Here you use a two tower approach, with cos sim between embeddings - I think this could mean a slightly different approach as to how best to prep sequential data for this type of system.

WDYT @maciejkula I think even a few rough suggestions would be helpful for others in the community.

ydennisy avatar Aug 06 '21 18:08 ydennisy

@ydennisy You might find this helpful: ondevice_recommendation.ipynb (section Try out data preparation ). Actually I'm trying to implement such technic in tfrs model myself.

govorec avatar Aug 21 '21 21:08 govorec

Thanks @govorec looking at this, they have simply taken the last movie rated as the predicted item for each user sequence.

Do you not think it could be a good idea to break a user sequence into chunks and use multiple times?

How is your implementation coming along?

ydennisy avatar Sep 13 '21 16:09 ydennisy

We have implemented a similar logic by selecting a sliding window of (up to) N items of the sequence (left padded with 0s) as input to a GRU and the next item as y. So for N=5, and a sequence of [a,b,c,d,e,f,g,h] it would be:

[0,0,0,0,a] -> b [0,0,0,a,b] -> c [0,0,a,b,c] -> d [0,a,b,c,d] -> e [a,b,c,d, e] -> f [b,c,d,e,f] -> g [c,d,e,f, g] -> h

We tried also keeping the variable length of the vectors:

[0,0,0,0,0,0,a] -> b [0,0,0,0,0,a,b] -> c [0,0,0,0,a,b,c] -> d [0,0,0,a,b,c,d] -> e [0,0,a,b,c,d, e] -> f [0,a,b,c,d,e,f] -> g [a,b,c,d,e,f, g] -> h

but some of the vectors were way too long, leading to too sparse vectors.

The selection of N depends on the use case. In our case, we also use users' ids, so we need the sequence to model the user's behaviour between the model's training intervals. My intuition tells me that their behaviour before the last model's training has been "captured" through their 'user_id'. (Please, anyone with the theoretical background, confirm or reject this assumption).

Also, I think that N may be tuned like a hyperparameter. I haven't tried it, though.

YannisPap avatar Sep 27 '21 15:09 YannisPap

Hi @ydennisy, I have the same doubt as you mentioned in your comments. What is the best approach you found to deal with sequential data? Did you chose last item of the sequence as label or Did you divide the sequence in chunks?

I would be happy to know what is the best approach to achieve sequential recommendation?

I am looking to prepare a recommendation which involves user interactions and predict the next item in sequence, so if you can help me to understand some data preparation steps.

Thanks

karndeepsingh avatar Sep 30 '23 13:09 karndeepsingh