MLUtils.jl
MLUtils.jl copied to clipboard
`splitobs` and `DataLoader` make views, but say they require only `getobs`
It's surprising that splitobs and DataLoader make views, when they mention only getobs in their docstrings, which does not:
help?> splitobs
splitobs(data; at, shuffle=false) -> Tuple
Split the data into multiple subsets proportional to the value(s) of at.
If shuffle=true, randomly permute the observations before splitting.
Supports any datatype implementing the numobs and getobs interfaces.
[...]
julia> splitobs(ones(1,5); at=0.2)[1]
1×1 view(::Matrix{Float64}, :, 1:1) with eltype Float64:
1.0
julia> getobs(ones(1,5), 1:2) # what it says it does
1×2 Matrix{Float64}:
1.0 1.0
julia> obsview(ones(1,5), 1:2) # what it actually uses
1×2 view(::Matrix{Float64}, :, 1:2) with eltype Float64:
1.0 1.0
help?> DataLoader
search: DataLoader DataType
DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])
An object that iterates over mini-batches of data, each mini-batch containing batchsize
observations (except possibly the last one).
Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data
object that implements the numobs and getobs methods.
[...]
This means that they do not preserve OneHotArrays, which is https://github.com/FluxML/OneHotArrays.jl/issues/40 .
But more generally, perhaps copies are just safer?
Looking back through the blame, I'm guessing the idea was to not materialize large datasets. The easiest path would be to change the docs, but that doesn't solve the OneHotArrays issue. Maybe https://github.com/JuliaML/MLUtils.jl/blob/af7ebeacdf5a5e6e94e0db55a5dc24835ef260e0/src/obsview.jl#L224-L227 is too general and obsview should be returning ObsView for more types?
Yes I'm sure not materialising was the goal. However, some views are more useful than others. I wonder if the default should be something like splitobs(ones(1,10); at=0.5) makes two contiguous views, which are almost as good as Arrays, but splitobs(ones(1,10); at=0.5, shuffle=true) makes copies?
The OneHotArrays issue could be solved on that side, by changing what view does. Or more narrowly by changing what obsview does.
Regardless of solution, the docstring should 100% be updated to mention some type of view is returned with ObsView being the detault.
The OneHotArrays issue could be solved on that side, by changing what view does
How straightforward do you think this would be? Since https://github.com/FluxML/OneHotArrays.jl/issues/40 is mostly about performance, another idea would be adding a kwarg which controls whether a view or copy is returned.