sansa
sansa copied to clipboard
Create a unified demo notebook
Currently there are two demo notebooks (one in the examples folder, the other in colab). We want to unify them (single demo notebook), declutter the experience and create examples for more datasets. It should be easy to add new datasets in the future.
- [ ] separate "demo utils" from the demo notebook
- [ ] create a unified demo notebook with Amazon Books dataset
- [ ] add Open in colab badge (generate at https://openincolab.com/ and convert to markdown) to README
- [ ] create example notebook for one more dataset
I can suggest the Cornac library to simplify dataset loading. For me it was very confusing that ~60% of the colab notebook deals with data loading, unrelated to the actual model. Cornac's Dataset class has a csr_matrix getter (docs) and other useful helpers like automatically mapping string ids to an index.
Caveat: I simply use a RatioSplit instead of the hard-coded test-targets you use in your example Colab notebook. I don't know if Cornac supports this, but I guess it does. This looks promising: cornac/examples/given_data.py.
The code I use is very simple:
import pandas as pd
from cornac.eval_methods import RatioSplit
from sansa import SANSA
# Load data
df = pd.read_parquet("train-set.parquet")
# Convert dataframe to list of tuples (user_id, item_id, rating)
data = df[["user_id", "item_id", "rating"]].values.tolist()
# Split into train and test set
rs = RatioSplit(data, test_size=0.1, rating_threshold=2.0, seed=1551)
# Train model
model = SANSA(config)
model.fit(rs.train_set.csr_matrix)
On a sidenote, I first tried training EASE with ~500k items and kept getting crashes because it requires too much memory. I created this issue: https://github.com/PreferredAI/cornac/issues/654 Then I found SANSA :)