[FEA] Negative sampling for positive-only datasets
Motivation
Public datasets are generally provided with negative samples to make it easier to train and compare results for different algorithms. Although, the most common for industry use cases is to have a dataset with only the users interactions (positive-only), as items that the user (might) have seen and not interacted are usually not logged. Most modern neural architectures need negative candidates for optimized training, because the number of items in the catalog of large-scale recsys is in the order of millions.
Requirements
RQ01 - Be Available in both NVT Pre-processing and Data Loading
The candidate sampling should be primarily performed by the NVT Data Loader, so that for different epochs we might have different negative samples for each positive sample. But it could also be available during pre-processing, in cases you would like to persist some fixed negative samples to compare different training algorithms which might not use the NVT Data Loader.
RQ02 - Feature Sets config
Provide a configuration of feature sets to bring RecSys taxonomy for some important features. That configuration will be used during NVT pre-processing, and should be persisted to be available also for the NVT Data Loaders and for custom training/eval scripts. The minimum features sets to allow candidate sampling managed by NVT and temporal dataset split are:
- Item id feature - Used by candidate sampling, as it is the key that represents a candidate item.
- Item metadata features - Used by candidate sampling, when item metadata attributes are fed as input features for recommendation (for hybrid recommendation architectures, like W&D and DLRM), because we need to provide those item metadata features for both positives and negative samples.
RQ03 - Recommendable items set
Provide the following methods to form the recommendable items set, composed by items that were available for users in a given point of time, to be considered as a valid negative samples:
- Global - All items in the dataset are considered as recommendable
-
Temporal
For a given training or eval batch:
- Past - All previously observed items are recommendable
- Recent - All items with events observed within the last N minutes/hours/days are recommendable
- Recent batches - Only items within the current batch or previous batches (buffer) are recommendable, assuming that - batches are mildly sorted by time
RQ04 - Sampling methods
Provide the following methods for negative sampling from the recommendable items set:
- Uniform sampling - All recommendable items have the sample probability to be sampled
- Recency sampling - Fresh items have a higher probability to be sampled This one requires keeping a table with the first timestamp when item has been seen (i.e. its “release” date), to compute the “age” of the item at a given point of time
- Popularity sampling - Probability is the item’s past popularity (normalized by the popularity of all other items).
- Recent Popularity sampling - Probability is the item’s relative popularity within a recent time frame (e.g. 1 hour / day / week)
References: Doc - NVTabular - Requirements on pre-processing for session-based recommendation and candidate sampling
As a side note, I have read recently the paper "Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison" from RecSys 2020, where they perform a rigorous evaluation of many algorithms, datasets, preprocessing strategies, losses functions and negative sampling strategies. In section 3.3, they show that Uniform Sampling, although simple, usually produced models with better accuracy than Popularity negative sampling.
It is important to note that their models use only user and item ids (CF). But for models leveraging with additional features (e.g. item popularity, target encoding of item id), such features could leak which are the positive (usually popular items) and the negatives (usually unpopular items if uniformly sampled).
Thus, it is also important to provide the popularity-based negative sampling in this feature too, and maybe a setting to control the percentage of negative items that will be sampled from uniform and from popularity distribution, like in this paper (Section 4.1).
I have implemented an example of sampling with cuDF, where you can set a continuous parameter which ranges between 0.0 (uniform sampling and 1.0 (popularity sampling). This provides more flexibility to the user, and might be a hyperparameter in the training pipeline.
Side note: I have a lot of question marks about negative sampling strategies, loss functions, and offline evaluation after reading "How Sensitive is Recommendation Systems’ Offline Evaluation to Popularity?" Although the paper is framed as being about evaluation, I think it's also revealing about the impact of different sampling strategies (e.g. BPR vs. WARP) on popularity-related biases. This is an area I'd love to explore and understand better.
Hey ! Not sure If I have the "right" nor should comment this FEA as a simple library user but I was curious if this was somewhat implemented. We are especially interested in the implementation of the so called time based Recommendable items set to generate realistic negative samples. Looking at the current doc of the library on NVT and Merlin Models I didn't see any clear evidence that this was implemented, do you confirm ?
@guillaume-chech for sampling please see the models lib, for example we have in batch neg sampling implemented.
@guillaume-chech for sampling please see the
modelslib, for example we have in batch neg sampling implemented.
Hey @rnyak, Yes I did have a look at this, it's a fair approximation yet not equivalent to what is described in RQ-03. So I guess this was not implemented.