MLUtils.jl icon indicating copy to clipboard operation
MLUtils.jl copied to clipboard

`mapobs` doesn't work with named tuples

Open darsnack opened this issue 2 years ago • 5 comments

If getobs returns a named tuple, then mapobs will fail.

darsnack avatar Apr 05 '22 18:04 darsnack

Closing, since I guess the solution here is to pass f as a named tuple to mapobs?

darsnack avatar Apr 05 '22 18:04 darsnack

The current mapobs story is very confusing. mapobs(fs::NamedTuple, data) will call a separate f in fs on each observation of data. Why does NamedTupleData exist? Can't it just be (; k = mapobs(f, data) for (k, v) in fs)?

This is separate from the original issue which is that mapobs(f, data) will fail when getobs is called with an AbstractVector of indices and getobs(data, ...) returns a named tuple. This is because mapobs will attempt to broadcast f over the named tuple. Instead, there needs to be some indirection based on the return type of getobs(data, ...) to handle the tuple and named tuple cases.

darsnack avatar Apr 05 '22 19:04 darsnack

Maybe something like this would work:

Base.getindex(data::MLUtils.MappedData, idx::Integer) = data.f(getobs(data.data, idx))
Base.getindex(data::MLUtils.MappedData, idxs::AbstractVector) =
    batch(map(Base.Fix1(getindex, data), idxs))

darsnack avatar Apr 05 '22 20:04 darsnack

It exists mostly for the case where you have one part of the data that is inexpensive to load, and one that is not. Example:

images = mapobs(FileIO.load, imagefiles)  # expensive to load
labels = mapobs(filename, imagefiles)  # not expensive to load

data = (images, labels)

Now we want to look inside data at the labels to find the unique label values that exist. The naive approach is:

labels_ = mapobs(obs -> obs[2], data)

However, getobs(labels_) will now load all images just to throw them away.

NamedTupleData exists for this case to allow you to use mapobs on a data container while still being able to retrieve columns seprately if the functions you use are applied column-wise anyway.

lorenzoh avatar Apr 06 '22 12:04 lorenzoh

But if your container is a (named) tuple, why not just do mapobs(f, data[2])?

darsnack avatar Apr 06 '22 17:04 darsnack