lenskit icon indicating copy to clipboard operation
lenskit copied to clipboard

Implement data processing for sequential RMSE

Open mdekstrand opened this issue 9 years ago • 0 comments
trafficstars

We can implement sequential RMSE by preparing data sets for the train-test evaluator. For day d, the training data will consist of all ratings prior to d, and the test data will consist of the ratings for day d.

There are several pieces required to do this:

  • [ ] Add timestamp limiting to the new DAO infrastructure (#948)
  • [ ] Write a new class, SequentialSplitter, that works somewhat like Crossfolder in that it takes input data (which will be required to be a PackedDataSource) and an output directory and writes out split data. It should take a configurable time period for the splitting and a start time (to say e.g. don't start splitting for the first 30 days). It will then go through the ratings data in the input pack, and for each period it will write out a file such as output-dir/test-ratings-100.csv (for day 100); it will also write output-dir/dataset-100.json containing a DataSetSpec whose training data is the input pack with a limit timestamp to exclude data before the time period, and whose training data is the CSV file of test ratings. The data set spec should have an attribute period that contains the period number (e.g. 100), and its name should also be configurable. Use SpecUtils to create the JSON file.
  • [ ] Write a SequentialSplit CLI command, like Crossfold or Simulate, that uses the sequential splitter to split data.
  • [ ] Write a SequentialSplit Gradle task, like Crossfold, that runs the sequential split command.
  • [ ] Write a test for the sequential split Gradle task in lenskit-integration-tests.

mdekstrand avatar Nov 20 '15 19:11 mdekstrand