MLUtils.jl
MLUtils.jl copied to clipboard
`mapobs` doesn't work with named tuples
If getobs
returns a named tuple, then mapobs
will fail.
Closing, since I guess the solution here is to pass f
as a named tuple to mapobs
?
The current mapobs
story is very confusing. mapobs(fs::NamedTuple, data)
will call a separate f in fs
on each observation of data. Why does NamedTupleData
exist? Can't it just be (; k = mapobs(f, data) for (k, v) in fs)
?
This is separate from the original issue which is that mapobs(f, data)
will fail when getobs
is called with an AbstractVector
of indices and getobs(data, ...)
returns a named tuple. This is because mapobs
will attempt to broadcast f
over the named tuple. Instead, there needs to be some indirection based on the return type of getobs(data, ...)
to handle the tuple and named tuple cases.
Maybe something like this would work:
Base.getindex(data::MLUtils.MappedData, idx::Integer) = data.f(getobs(data.data, idx))
Base.getindex(data::MLUtils.MappedData, idxs::AbstractVector) =
batch(map(Base.Fix1(getindex, data), idxs))
It exists mostly for the case where you have one part of the data that is inexpensive to load, and one that is not. Example:
images = mapobs(FileIO.load, imagefiles) # expensive to load
labels = mapobs(filename, imagefiles) # not expensive to load
data = (images, labels)
Now we want to look inside data
at the labels to find the unique label values that exist. The naive approach is:
labels_ = mapobs(obs -> obs[2], data)
However, getobs(labels_)
will now load all images just to throw them away.
NamedTupleData
exists for this case to allow you to use mapobs
on a data container while still being able to retrieve columns seprately if the functions you use are applied column-wise anyway.
But if your container is a (named) tuple, why not just do mapobs(f, data[2])
?