OnlineAI.jl icon indicating copy to clipboard operation
OnlineAI.jl copied to clipboard

Splitting the data

Open Evizero opened this issue 10 years ago • 1 comments

We should probably exchange ideas on data sampling / splitting.

The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.

Let X be a 10x10000000 Array{Float64,2}

julia> @time shuffle!(X)
elapsed time: 1.202112921 seconds (80 bytes allocated)

julia> @time train = view(X, :, 1:7000000)
elapsed time: 1.7596e-5 seconds (192 bytes allocated)

julia> @time test = view(X, :, 7000001:10000000)
elapsed time: 1.3097e-5 seconds (192 bytes allocated)

I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution

Evizero avatar Sep 04 '15 19:09 Evizero

See the nnet/data.jl source file... I'm using the idea of fixed arrays of DataPoint objects, and wrappers which access the arrays in different ways. I haven't benchmarked completely, though.

On Sep 4, 2015, at 3:56 PM, Christof Stocker [email protected] wrote:

We should probably exchange ideas on data sampling / splitting.

The approach I am currently using is very memory friendly for huge datasets by shuffling the data-matrix in-place and then making continuous array views.

Let X be a 10x10000000 Array{Float64,2}

julia> @time shuffle!(X) elapsed time: 1.202112921 seconds (80 bytes allocated)

julia> @time train = view(X, :, 1:7000000) elapsed time: 1.7596e-5 seconds (192 bytes allocated)

julia> @time test = view(X, :, 7000001:10000000) elapsed time: 1.3097e-5 seconds (192 bytes allocated) I couldn't find a better way so far. It does has its limitations when the sampling should be sensitive to the class distribution

— Reply to this email directly or view it on GitHub.

tbreloff avatar Sep 04 '15 21:09 tbreloff