MLDataPattern.jl Make the `at` parameter of `splitobs` accept vector

trafficstars

Currently splitobs(data, at[, obsdim]) → NTuple is slow when splitting some data into many many parts. For example, @time splitobs(rand(100000), at=ntuple(i->1/10001, 10000)) will take forever to run.

Proposal: make at accept vector, and return vector of data.

Apr 20 '20 07:04 innerlee

Anyone?

Apr 27 '20 06:04 innerlee

I don't have a use case other than splitting data into more than three parts: train/valid/test. If there's such a real need in practice, then this can be a good idea.

P.S. If you want to make 10000 batches, you could use RandomBatches(data, batch_sz, 10000) instead. (Ref: https://mldatapatternjl.readthedocs.io/en/latest/documentation/dataiterator.html#RandomBatches)

Apr 28 '20 00:04 johnnychen94

I do not need random batches. I need list of subsets whose disjoint union is the whole dataset.

splitobs has this semantics, but RandomBatches does not.

Apr 28 '20 01:04 innerlee

In that case, I think eachbatch/batchview already works nicely.

This is a nice-to-have feature. I personally don't know if there's a real use case for general vector input other than the fixed-size batches like the one you provide.

I'm personally not motivated to implement it. If you could give this an implementation, I'm glad to get it in.

Apr 28 '20 02:04 johnnychen94

Actually not exactly. It drops elements.

julia> a = collect(1:10)
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> b = batchview(a, 3)
3-element batchview(::Array{Int64,1}, 3, 3, ObsDim.Last()) with eltype SubArray:
 [1, 2, 3]
 [4, 5, 6]
 [7, 8, 9]

Apr 28 '20 02:04 innerlee

It drops elements.

Well, yes. However, as long as the data get shuffled and the model gets trained in many epochs, since every sample would be seen, that won't give any visible change the final results.

cref: https://github.com/JuliaML/MLDataPattern.jl/issues/14

Apr 28 '20 02:04 johnnychen94

BTW, I'm skeptical it's the tuple/vector difference that affects the splitting performance.

For example, @time splitobs(rand(100000), at=ntuple(i->1/10001, 10000)) will take forever to run.

Looks like it's a Julia bug: ntuple(i->1/10001, 10000)

Apr 28 '20 02:04 johnnychen94

How about inferring? One cannot omit samples.

Apr 28 '20 02:04 innerlee

Yeah one cannot easily generate long tuples. So if we change the interface to Vector, things will be easier

Apr 28 '20 02:04 innerlee

A second thought. This propose makes sense since the tuple/vector difference is usually not a performance bottleneck for ml tasks.

The long tuple issue is possibly related to https://github.com/JuliaLang/julia/issues/35547 ~The current workaround for your example code is: tuple(fill(1/10001, 10000))~

Let me see if I can get some time to rework the splitobs during the labor day holiday.

Apr 28 '20 02:04 johnnychen94

MLDataPattern.jl MLDataPattern.jl copied to clipboard

Make the `at` parameter of `splitobs` accept vector

MLDataPattern.jl
MLDataPattern.jl copied to clipboard