MLDataPattern.jl icon indicating copy to clipboard operation
MLDataPattern.jl copied to clipboard

Make the `at` parameter of `splitobs` accept vector

Open innerlee opened this issue 5 years ago • 10 comments
trafficstars

Currently splitobs(data, at[, obsdim]) → NTuple is slow when splitting some data into many many parts. For example, @time splitobs(rand(100000), at=ntuple(i->1/10001, 10000)) will take forever to run.

Proposal: make at accept vector, and return vector of data.

innerlee avatar Apr 20 '20 07:04 innerlee

Anyone?

innerlee avatar Apr 27 '20 06:04 innerlee

I don't have a use case other than splitting data into more than three parts: train/valid/test. If there's such a real need in practice, then this can be a good idea.

P.S. If you want to make 10000 batches, you could use RandomBatches(data, batch_sz, 10000) instead. (Ref: https://mldatapatternjl.readthedocs.io/en/latest/documentation/dataiterator.html#RandomBatches)

johnnychen94 avatar Apr 28 '20 00:04 johnnychen94

I do not need random batches. I need list of subsets whose disjoint union is the whole dataset.

splitobs has this semantics, but RandomBatches does not.

innerlee avatar Apr 28 '20 01:04 innerlee

In that case, I think eachbatch/batchview already works nicely.

This is a nice-to-have feature. I personally don't know if there's a real use case for general vector input other than the fixed-size batches like the one you provide.

I'm personally not motivated to implement it. If you could give this an implementation, I'm glad to get it in.

johnnychen94 avatar Apr 28 '20 02:04 johnnychen94

Actually not exactly. It drops elements.

julia> a = collect(1:10)
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> b = batchview(a, 3)
3-element batchview(::Array{Int64,1}, 3, 3, ObsDim.Last()) with eltype SubArray:
 [1, 2, 3]
 [4, 5, 6]
 [7, 8, 9]

innerlee avatar Apr 28 '20 02:04 innerlee

It drops elements.

Well, yes. However, as long as the data get shuffled and the model gets trained in many epochs, since every sample would be seen, that won't give any visible change the final results.

cref: https://github.com/JuliaML/MLDataPattern.jl/issues/14

johnnychen94 avatar Apr 28 '20 02:04 johnnychen94

BTW, I'm skeptical it's the tuple/vector difference that affects the splitting performance.

For example, @time splitobs(rand(100000), at=ntuple(i->1/10001, 10000)) will take forever to run.

Looks like it's a Julia bug: ntuple(i->1/10001, 10000)

johnnychen94 avatar Apr 28 '20 02:04 johnnychen94

How about inferring? One cannot omit samples.

innerlee avatar Apr 28 '20 02:04 innerlee

Yeah one cannot easily generate long tuples. So if we change the interface to Vector, things will be easier

innerlee avatar Apr 28 '20 02:04 innerlee

A second thought. This propose makes sense since the tuple/vector difference is usually not a performance bottleneck for ml tasks.

The long tuple issue is possibly related to https://github.com/JuliaLang/julia/issues/35547 ~The current workaround for your example code is: tuple(fill(1/10001, 10000))~

Let me see if I can get some time to rework the splitobs during the labor day holiday.

johnnychen94 avatar Apr 28 '20 02:04 johnnychen94