lenskit
lenskit copied to clipboard
Implement data processing for sequential RMSE
trafficstars
We can implement sequential RMSE by preparing data sets for the train-test evaluator. For day d, the training data will consist of all ratings prior to d, and the test data will consist of the ratings for day d.
There are several pieces required to do this:
- [ ] Add timestamp limiting to the new DAO infrastructure (#948)
- [ ] Write a new class,
SequentialSplitter, that works somewhat likeCrossfolderin that it takes input data (which will be required to be aPackedDataSource) and an output directory and writes out split data. It should take a configurable time period for the splitting and a start time (to say e.g. don't start splitting for the first 30 days). It will then go through the ratings data in the input pack, and for each period it will write out a file such asoutput-dir/test-ratings-100.csv(for day 100); it will also writeoutput-dir/dataset-100.jsoncontaining aDataSetSpecwhose training data is the input pack with a limit timestamp to exclude data before the time period, and whose training data is the CSV file of test ratings. The data set spec should have an attributeperiodthat contains the period number (e.g. 100), and its name should also be configurable. UseSpecUtilsto create the JSON file. - [ ] Write a
SequentialSplitCLI command, likeCrossfoldorSimulate, that uses the sequential splitter to split data. - [ ] Write a
SequentialSplitGradle task, likeCrossfold, that runs the sequential split command. - [ ] Write a test for the sequential split Gradle task in
lenskit-integration-tests.