polara icon indicating copy to clipboard operation
polara copied to clipboard

NotImplementedError: Data has duplicate values

Open 99sbr opened this issue 3 years ago • 2 comments

data_model = ItemColdStartData( training_data, *training_data.columns, # userid, itemid item_features=content_feature_df, seed=seed)

print(data_model)

HERE IM GETTING ERROR: NotImplementedError: Data has duplicate values

My dataframe has multiple entries for a user. cant drop them. any help here

Screenshot 2022-03-25 at 14 29 38

99sbr avatar Mar 25 '22 09:03 99sbr

Hi!

The problem is not that your data contains multiple entries for a user, but that your data contains multiple entries of the same user-item pair. It's like having multiple ratings for the same movie from the same user. This is not a standard collaborative filtering scenario.

You need to deduplicate such entries, e.g., like this:

dedup_data = data.drop_duplicates(subset=['userid', 'movieid'])

evfro avatar Mar 26 '22 04:03 evfro

Understood thanks for the help.

Facing one more blocker. data_model.prepare() kind of takes a lot of time and freezes when I run the step. Any idea why? i know my dataset is big but any optimisation that can be followed? Screenshot 2022-03-26 at 14 52 36

99sbr avatar Mar 26 '22 09:03 99sbr