lightfm
lightfm copied to clipboard
Data Splitting Strategies besides Random Split
Hi all,
There are numerous ways besides random split in which the interactions dataset can be split, such as temporal split, user split etc. (See https://arxiv.org/pdf/2007.13237.pdf)
So far it seems like only the random split is available as part of the lightfm.cross_validation module
Are there any plans to add these different splitting methods into the package?
I'm using a customized temporal user split, which is not provided by the library. It takes the last n interactions per user and moves it to the test set, as the paper you mentioned, it's not the best strategy since it can leak popularity info, but still better than random split which is currently provided.
One can simply change the function to work as a global user split, which is the recommended.
def train_test_split(interactions, split_count, fraction=None):
"""
Perform a temporal user split.
Params
------
interactions : scipy.sparse matrix
Interactions between users and items.
split_count : int
Number of user-item-interactions per user to move
from training to test set.
fractions : float
Fraction of users to split off some of their last
interactions into test set. If None, then all
users are considered.
returns: train, test csr matrices.
"""
k = 5
# Note: likely not the fastest way to do things below.
train = interactions.copy().tocoo()
test = sparse.lil_matrix(train.shape)
if fraction:
try:
user_index = np.random.choice(
np.where(np.bincount(train.row) >= split_count * 2)[0],
replace=False,
size=np.int32(np.floor(fraction * train.shape[0]))
).tolist()
except:
print(('Not enough users with > {} '
'interactions for fraction of {}')\
.format(2*k, fraction))
raise
else:
user_index = range(train.shape[0])
train = train.tolil()
for user in user_index:
test_ratings = np.random.choice(interactions.getrow(user).indices,
size=split_count,
replace=False)
train[user, test_ratings] = 0.
# These are just 1.0 right now
test[user, test_ratings] = interactions[user, test_ratings]
# Test and training are truly disjoint
assert(train.multiply(test).nnz == 0)
return train.tocsr(), test.tocsr()
I, for one, would really like a temporal train test split, as this seems to me to be by far the best way to split if you have the data for it. To make one, would it simply be a case of:
- test set is a sparse matrix of all the data
- training set is a sparse matrix of all the data as of time t, with all data after t set to x where x is whatever a non interaction is? So if my data is 1 for "customer bought product" and 0 for "customer has never bought product" then set data > t as 0.