Own prediction problem: Extreme memory use, any solution?
Description
Hi, I'm trying to run predictions on my context. I need to get the final recommendations, so I followed the steps published in the FAQS. However, it's not working for me since it crashes in the middle, because it makes use of all the RAM memory available. First, I tried on my computer (8 GB) later on an AWS server (128 GB RAM) but it didn't work either. I do not what to do next for solving this problem. Notice that here I'm posting a SVD code but I tried using KNN (item-based) and I also run out of memory. It looks that the problems happens in this line: predictions = algo.test(testset)
My problem involves the following dimensions: Number of users: 134632 Number of items: 4629
I already looked in previous issues but didn't find anything useful in my case. Thank you
Steps/Code to Reproduce
from collections import defaultdict
from surprise import Dataset, Reader
from surprise.similarities import \
cosine, msd, pearson, pearson_baseline
from surprise.prediction_algorithms.knns import \
KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline
from surprise.model_selection import \
train_test_split, GridSearchCV, cross_validate
from surprise import accuracy, SVD
from surprise.model_selection import KFold
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
import csv
from surprise import accuracy
from datetime import datetime
start_time = datetime.now()
def get_top_n(predictions, n=10):
"""Return the top-N recommendation for each user from a set of predictions.
Args:
predictions(list of Prediction objects): The list of predictions, as
returned by the test method of an algorithm.
n(int): The number of recommendation to output for each user. Default
is 10.
Returns:
A dict where keys are user (raw) ids and values are lists of tuples:
[(raw item id, rating estimation), ...] of size n.
"""
# First map the predictions to each user.
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
top_n[uid].append((iid, est))
# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=False)
top_n[uid] = user_ratings[:n]
return top_n
# First train an SVD algorithm
postulaciones ='c1_postulaciones.csv'
reader = Reader(rating_scale = (1,12))
data = Dataset.load_from_df(data[["mrun", "rbd", "preferencia_postulante"]], reader)
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)
top_n = get_top_n(predictions, n=1)
##Dataframe con recomendaciones##
columnas = ["mrun", "rbd"]
recomendaciones = pd.DataFrame(columns = columnas)
# Print the recommended items for each user
for uid, user_ratings in top_n.items():
recomendacion=[iid for (iid, _) in user_ratings]
print(uid, [iid for (iid, _) in user_ratings])
agregar = {'mrun': uid, 'rbd': recomendacion}
recomendaciones = recomendaciones.append(agregar, ignore_index=True)
print("Trabajando")
recomendaciones['rbd'] = recomendaciones['rbd'].str[0]
recomendaciones.to_csv('recomendaciones.csv',sep=";")
Expected Results
Get final matrix with one recommendation per user
Actual Results
Script crashing, using at least 128 GB of ram
Versions
macOS-10.15.5-x86_64-i386-64bit Python 3.8.3 (default, Jul 2 2020, 11:26:31) [Clang 10.0.0 ]
Hola, late answer but i'd bet that building the anti-testset (in line testset = trainset.build_anti_testset()) will zap all your memory very easily. From the numbers you gave you'd have up to 134,632 x 4,629 = 623,211,528 combinations, which let's just say is a lot.
Hola, late answer but i'd bet that building the anti-testset (in line
testset = trainset.build_anti_testset()) will zap all your memory very easily. From the numbers you gave you'd have up to 134,632 x 4,629 = 623,211,528 combinations, which let's just say is a lot.
So what would be the best way to get around this situation? I'm in this exact same position, but I'm not sure what to do.
Would it be to gut the initial data set into a significantly smaller size?
Or, given that I'm only interested in computing the predictions for one specific user, is there some one to specify what UID to predict?
Or, would I just have to do a for loop on every IID item that the particular user hasn't voted on?