DimensionalData.jl
DimensionalData.jl copied to clipboard
`AbstractDimStack <: AbstractArray`
The idea came up trying to write rand samplers for AbstractDimStack.
Why isn't an AbstractDimArray an AbstractArray{<:NamedTuple} ?
With rand, we would always want the sampler to return a NamedTuple of values from each layer. But what does that mean with mixed-size layers? Currently indexing a DimStack with a linear Int doesn't do something sensible - it just indexes every layer at the same index in map. We should instead treat a DimStack exactly like the DimTable it can be unrolled into. Linear indexing a DimStack should always give the same value as indexing a row of its DimTable - that is it really indexes with:
getindex(st::AbstractDimStack, i::Int) = st[DimIndices(st)[i]...]
I had actually forgotten that it doesn't do that already and consider it almost a bug. When all layers are the same size there will be no difference, and we can still just use regular linear indexing on all layers.
With that change DimStack can basically be an AbstractArray. This means rand will work already without adding a sampler, and broadcast and iteration will also "just work", closing #412.
The main breaking change is to map that currently works over stack Array layers, not NamedTuple values. We will need to replace it with e.g. maplayers. This will need a lot of changes here and also in Rasters.jl. So probably in other packages too.
Any thoughts @sethaxen @felixcremer @JoshuaBillson ?
I largely support the idea. However, I think this raises a few questions:
- Would indexing into a
DimStackreturn the dimensional values, as it does with aDimTable? - Are
DimStacksgoing to be stored as an array ofNamedTuples, or is this simply a type-level abstraction as in StructArrays.jl? I'm concerned that the former would be highly inefficient for many operations. - Could we use this as an opportunity to support wide stacks? The current issue is that
NamedTuplehas extremely long compile times forDimStackswith many layers. If we choose to represent a standard stack as anAbstractArray{<:NamedTuple}, perhaps we could represent a wide stack as anAbstractArray{<:AbstractDict}. Indexing into a stack would then be analogous to aTables.rowtableorTables.dictrowtable, respectively. - A feature I've often found myself wishing for is union and intersection operators for combining stacks and arrays. For example, I could take a
RasterStackrepresenting a satellite image and aRastercontaining a DEM layer and take their intersection to produce a newRasterStackwith an extent equal to the area covered by both data sources. Similarly, a union operation could be used to combine two satellite images covering different geographic regions. One problem is thatNamedTuplesare immutable, so adding new layers on the fly is difficult. Is there any way we could support dynamically adding layers to an existing stack? - Would
broadcastoperate overNamedTuplesor their individual fields? I feel like the former would be tedious for many operations, like scaling all values by some constant. Also, what would happen if the broadcast function returns aNamedTuplewith different fields than the input? Would we get a newDimStackwith the same layers as the names in the outputNamedTuple? So broadcasting the functionf(x) = (sum = x.a + x.b, product = x.a * x.b)would take a stack with the layersaandband return a new stack with layerssumandproduct?
- No, just the values, not the dim values
- No storage stays the same. Its very similar to struct arrrays. (Its already how it works with dim/selector indexing too, so very small change).
- This is mostly already how it works! NamedTuple returned from indexing is not a new idea Im adding. But in Julia 1.11 NamedTuples should compile a lot faster and reduce your wide stack problem. If we still want Dicts we need a new stack mode that we can switch to. We could just change the new AbstractArray T in AbstractDimArray{T} from NamedTuble to Dict (with matching internals) and dispatch on that in
getindex. We would never switch completely to Dict (or Dictionary) as it's much slower for small stack use cases. - You can do intersect/union with
croporextend. If we had a mode switch to Dict we could do in-place adding layers. But what are the issues with usingmergeto make a new stack (besides wide stacks...)? - Broadcast would return an
AbstractDimArrayof your return values. You could do things like:
A = (x -> x.a * x.b).(st) .* C
Which is different to
A = st.a .* st.b .* C
Because it will handle mixed-size layers. But yes its a little clunky accessing nested values in broadcasts.
Probably a Dictionaries.jl Dictionary would have nicer properties as a mode switch from NamedTuple.
See: https://github.com/andyferris/Dictionaries.jl/pull/43
We really need getproperty on dictionary keys for this to work seamlessly.
For InferenceObjects.jl, we for the most part ignore that getindex returns a NamedTuple and primarily use DimStack as a NamedTuple of DimArrays with the extra check that Dims are consistent. The only exception is the Tables interface implementation. So I don't have a strong feeling either way, as I don't think it will affect my users much. My only concern is whether making an AbstractDimStack an AbstractArray might cause annoying defaults to be hit in other packages or method ambiguities.
- This is mostly already how it works! NamedTuple returned from indexing is not a new idea Im adding. But in Julia 1.11 NamedTuples should compile a lot faster and reduce your wide stack problem. If we still want Dicts we need a new stack mode that we can switch to. We could just change the new AbstractArray T in AbstractDimArray{T} from NamedTuble to Dict (with matching internals) and dispatch on that in
getindex. We would never switch completely to Dict (or Dictionary) as it's much slower for small stack use cases.
It would be a huge relief if 1.11 can address the compilation issue with NamedTuples, as it's been one of my major performance gripes. I wasn't suggesting that we replace NamedTuple with a Dictionary entirely, only that we make it possible to substitute one for the other in order to accommodate wide stacks. However, perhaps Julia 1.11 will solve this issue?
- Broadcast would return an
AbstractDimArrayof your return values. You could do things like:A = (x -> x.a * x.b).(st) .* CWhich is different to
A = st.a .* st.b .* CBecause it will handle mixed-size layers. But yes its a little clunky accessing nested values in broadcasts.
That makes sense and would work well enough in general. However, would it be possible to return an AbstractDimStack if the broadcasted function returns a NamedTuple and an AbstractDimArray if it returns a scalar?