lightfm icon indicating copy to clipboard operation
lightfm copied to clipboard

Data Splitting Strategies besides Random Split

Open kennethleungty opened this issue 3 years ago • 2 comments

Hi all,

There are numerous ways besides random split in which the interactions dataset can be split, such as temporal split, user split etc. (See https://arxiv.org/pdf/2007.13237.pdf)

So far it seems like only the random split is available as part of the lightfm.cross_validation module

Are there any plans to add these different splitting methods into the package?

kennethleungty avatar Sep 17 '21 08:09 kennethleungty

I'm using a customized temporal user split, which is not provided by the library. It takes the last n interactions per user and moves it to the test set, as the paper you mentioned, it's not the best strategy since it can leak popularity info, but still better than random split which is currently provided.

One can simply change the function to work as a global user split, which is the recommended.

def train_test_split(interactions, split_count, fraction=None):
    """
    Perform a temporal user split.
    
    Params
    ------
    interactions : scipy.sparse matrix
        Interactions between users and items.
    split_count : int
        Number of user-item-interactions per user to move
        from training to test set.
    fractions : float
        Fraction of users to split off some of their last
        interactions into test set. If None, then all 
        users are considered.
        
    returns: train, test csr matrices.
    """
    k = 5
    # Note: likely not the fastest way to do things below.
    train = interactions.copy().tocoo()
    test = sparse.lil_matrix(train.shape)
    
    if fraction:
        try:
            user_index = np.random.choice(
                np.where(np.bincount(train.row) >= split_count * 2)[0], 
                replace=False,
                size=np.int32(np.floor(fraction * train.shape[0]))
            ).tolist()
        except: 
            print(('Not enough users with > {} '
                  'interactions for fraction of {}')\
                  .format(2*k, fraction))
            raise
    else:
        user_index = range(train.shape[0])
        
    train = train.tolil()

    for user in user_index:
        test_ratings = np.random.choice(interactions.getrow(user).indices, 
                                        size=split_count, 
                                        replace=False)
        train[user, test_ratings] = 0.
        # These are just 1.0 right now
        test[user, test_ratings] = interactions[user, test_ratings]
   
    
    # Test and training are truly disjoint
    assert(train.multiply(test).nnz == 0)
    return train.tocsr(), test.tocsr()

henriqueluzz avatar Jan 07 '22 13:01 henriqueluzz

I, for one, would really like a temporal train test split, as this seems to me to be by far the best way to split if you have the data for it. To make one, would it simply be a case of:

  • test set is a sparse matrix of all the data
  • training set is a sparse matrix of all the data as of time t, with all data after t set to x where x is whatever a non interaction is? So if my data is 1 for "customer bought product" and 0 for "customer has never bought product" then set data > t as 0.

Richie-Peak avatar Aug 15 '22 14:08 Richie-Peak