MLUtils.jl icon indicating copy to clipboard operation
MLUtils.jl copied to clipboard

Status of MLDataPattern porting

Open CarloLucibello opened this issue 3 years ago • 6 comments

A list of what is currently exported from MLDataPattern.jl.

TO PORT

  • [x] getobs, getobs! and nobs. #1
    • nobs is now numobs;
    • obsdim argument is dropped from the interface
  • [x] randobs #1
  • [x] datasubset, DataSubset #4
  • [x] shuffleobs #5
  • [x] splitobs #5
  • [x] DataView #5
    • [ ] Consider removal
  • [x] obsview, ObsView #5
    • [ ] Consider removal #8
  • [x] batchview, BatchView #6
  • [x] batchsize #6
  • [ ] slidingwindow, SlidingWindow
  • [ ] stratifiedobs
  • [x] oversample, undersample #10
  • [x] kfolds #9
  • [x] leaveout #9
  • [x] eachobs #9
  • [x] eachbatch #9

NOT TO BE PORTED

  • BufferGetObs
  • RandomObs, RandomBatches
  • BalancedObs
  • FoldView
  • targets
  • eachtarget

CarloLucibello avatar Dec 27 '21 11:12 CarloLucibello

We can consider this essentially done

CarloLucibello avatar Jan 30 '22 09:01 CarloLucibello

Hi, what about stratifiedobs and slidingwindow? Were they explicitly excluded on purpose? Thanks

rmkn85 avatar Aug 04 '22 21:08 rmkn85

not really, we just didn't port code that we weren't sure was going to be useful. I think stratifiedobs should go in, less sure of slidingwindow but didn't look much into it and alternatives in the ecosystem.

CarloLucibello avatar Aug 05 '22 07:08 CarloLucibello

Just to clarify, I came here specifically for missing stratifiedobs. It is needed to replicate the behaviour of Python's sklearn.model_selection.train_test_split([...] stratify=true)

Asked about slidingwindow on the way, since it was the only other one unchecked but not in the list of explicitly "not to be ported", but I don't have any use-case for it.

rmkn85 avatar Aug 05 '22 07:08 rmkn85

I use slidingwindow often for time series data. Haven't looked too much for a replacement but the closest I've found is IterTools.jl partition. It has a similar interface but returns a tuple iterator

kpa28-git avatar Aug 11 '22 09:08 kpa28-git

Also found DSP.Periodograms.arraysplit which is similar to slidingwindow but you set the overlap instead of the stride. So far slidingwindow is the fastest of the three because it returns views.

kpa28-git avatar Aug 11 '22 09:08 kpa28-git