dh-core icon indicating copy to clipboard operation
dh-core copied to clipboard

Cross validation layer

Open o1lo01ol1o opened this issue 6 years ago • 1 comments

Looking over the Dataloader code, I immediately thought about integrating a private dataset to play with some haskell code. This made me wonder if anyone has thought about adding a cross-validation layer on top of it. For some cannonical datasets, there are defined splits (test, train, validation), but for others, one would need to define these.

It would be nice if there were some code that could allow one to partition some given data according to k-folds and leave-p-out. In the case of timeseries datasets, you'd have to make sure the that the partitions respect the temporal ordering.

o1lo01ol1o avatar Feb 13 '19 15:02 o1lo01ol1o

Yeah! Cross-validation is an excellent next step. When working on #22, I was trying to get a rough lay-of-the-land and didn't want to overcomplicate the PR. Toy CV benchmarks like MNIST and the CIFARs pre-split test and train, so I opted not to add scope creep.

I was hoping that all of the partitionings would operate on Vector Ints and passed into Dataloaders. The idea was that, given a Dataset, someone could write a function:

splits
  :: Vector Int     -- ^ dataset's index
  -> testspec       -- ^ TBD
  -> trainspec      -- ^ TBD
  -> (Vector Int, Vector Int)  -- ^ a test and train split of the indexes

And then these Vector Int splits could be passed into a Dataloader's shuffle field, which just uses Data.Vector.backpermute under the hood (here).

I didn't have time to follow up on this, but I was also thinking that it might be nice to refactor Datasets to have a unified streaming API and only have the Dataloader handle transforms and shuffling (which might change the API a smidge).

stites avatar Feb 13 '19 19:02 stites