MLDataPattern.jl
MLDataPattern.jl copied to clipboard
Make the `at` parameter of `splitobs` accept vector
Currently splitobs(data, at[, obsdim]) → NTuple is slow when splitting some data into many many parts.
For example, @time splitobs(rand(100000), at=ntuple(i->1/10001, 10000)) will take forever to run.
Proposal: make at accept vector, and return vector of data.
Anyone?
I don't have a use case other than splitting data into more than three parts: train/valid/test. If there's such a real need in practice, then this can be a good idea.
P.S. If you want to make 10000 batches, you could use RandomBatches(data, batch_sz, 10000) instead. (Ref: https://mldatapatternjl.readthedocs.io/en/latest/documentation/dataiterator.html#RandomBatches)
I do not need random batches. I need list of subsets whose disjoint union is the whole dataset.
splitobs has this semantics, but RandomBatches does not.
In that case, I think eachbatch/batchview already works nicely.
This is a nice-to-have feature. I personally don't know if there's a real use case for general vector input other than the fixed-size batches like the one you provide.
I'm personally not motivated to implement it. If you could give this an implementation, I'm glad to get it in.
Actually not exactly. It drops elements.
julia> a = collect(1:10)
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
julia> b = batchview(a, 3)
3-element batchview(::Array{Int64,1}, 3, 3, ObsDim.Last()) with eltype SubArray:
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
It drops elements.
Well, yes. However, as long as the data get shuffled and the model gets trained in many epochs, since every sample would be seen, that won't give any visible change the final results.
cref: https://github.com/JuliaML/MLDataPattern.jl/issues/14
BTW, I'm skeptical it's the tuple/vector difference that affects the splitting performance.
For example,
@time splitobs(rand(100000), at=ntuple(i->1/10001, 10000))will take forever to run.
Looks like it's a Julia bug: ntuple(i->1/10001, 10000)
How about inferring? One cannot omit samples.
Yeah one cannot easily generate long tuples. So if we change the interface to Vector, things will be easier
A second thought. This propose makes sense since the tuple/vector difference is usually not a performance bottleneck for ml tasks.
The long tuple issue is possibly related to https://github.com/JuliaLang/julia/issues/35547 ~The current workaround for your example code is: tuple(fill(1/10001, 10000))~
Let me see if I can get some time to rework the splitobs during the labor day holiday.