DimensionalData.jl `AbstractDimStack <: AbstractArray`

The idea came up trying to write rand samplers for AbstractDimStack.

Why isn't an AbstractDimArray an AbstractArray{<:NamedTuple} ?

With rand, we would always want the sampler to return a NamedTuple of values from each layer. But what does that mean with mixed-size layers? Currently indexing a DimStack with a linear Int doesn't do something sensible - it just indexes every layer at the same index in map. We should instead treat a DimStack exactly like the DimTable it can be unrolled into. Linear indexing a DimStack should always give the same value as indexing a row of its DimTable - that is it really indexes with:

getindex(st::AbstractDimStack, i::Int) = st[DimIndices(st)[i]...]

I had actually forgotten that it doesn't do that already and consider it almost a bug. When all layers are the same size there will be no difference, and we can still just use regular linear indexing on all layers.

With that change DimStack can basically be an AbstractArray. This means rand will work already without adding a sampler, and broadcast and iteration will also "just work", closing #412.

The main breaking change is to map that currently works over stack Array layers, not NamedTuple values. We will need to replace it with e.g. maplayers. This will need a lot of changes here and also in Rasters.jl. So probably in other packages too.

Any thoughts @sethaxen @felixcremer @JoshuaBillson ?

Dec 11 '23 15:12 rafaqz

I largely support the idea. However, I think this raises a few questions:

Would indexing into a DimStack return the dimensional values, as it does with a DimTable?
Are DimStacks going to be stored as an array of NamedTuples, or is this simply a type-level abstraction as in StructArrays.jl? I'm concerned that the former would be highly inefficient for many operations.
Could we use this as an opportunity to support wide stacks? The current issue is that NamedTuple has extremely long compile times for DimStacks with many layers. If we choose to represent a standard stack as an AbstractArray{<:NamedTuple}, perhaps we could represent a wide stack as an AbstractArray{<:AbstractDict}. Indexing into a stack would then be analogous to a Tables.rowtable or Tables.dictrowtable, respectively.
A feature I've often found myself wishing for is union and intersection operators for combining stacks and arrays. For example, I could take a RasterStack representing a satellite image and a Raster containing a DEM layer and take their intersection to produce a new RasterStack with an extent equal to the area covered by both data sources. Similarly, a union operation could be used to combine two satellite images covering different geographic regions. One problem is that NamedTuples are immutable, so adding new layers on the fly is difficult. Is there any way we could support dynamically adding layers to an existing stack?
Would broadcast operate over NamedTuples or their individual fields? I feel like the former would be tedious for many operations, like scaling all values by some constant. Also, what would happen if the broadcast function returns a NamedTuple with different fields than the input? Would we get a new DimStack with the same layers as the names in the output NamedTuple? So broadcasting the function f(x) = (sum = x.a + x.b, product = x.a * x.b) would take a stack with the layers a and b and return a new stack with layers sum and product?

Dec 12 '23 02:12 JoshuaBillson

No, just the values, not the dim values
No storage stays the same. Its very similar to struct arrrays. (Its already how it works with dim/selector indexing too, so very small change).
This is mostly already how it works! NamedTuple returned from indexing is not a new idea Im adding. But in Julia 1.11 NamedTuples should compile a lot faster and reduce your wide stack problem. If we still want Dicts we need a new stack mode that we can switch to. We could just change the new AbstractArray T in AbstractDimArray{T} from NamedTuble to Dict (with matching internals) and dispatch on that in getindex. We would never switch completely to Dict (or Dictionary) as it's much slower for small stack use cases.
You can do intersect/union with crop or extend. If we had a mode switch to Dict we could do in-place adding layers. But what are the issues with using merge to make a new stack (besides wide stacks...)?
Broadcast would return an AbstractDimArray of your return values. You could do things like:

A = (x -> x.a * x.b).(st) .* C

Which is different to

A = st.a .* st.b .* C

Because it will handle mixed-size layers. But yes its a little clunky accessing nested values in broadcasts.

Dec 12 '23 09:12 rafaqz

Probably a Dictionaries.jl Dictionary would have nicer properties as a mode switch from NamedTuple.

See: https://github.com/andyferris/Dictionaries.jl/pull/43

We really need getproperty on dictionary keys for this to work seamlessly.

Dec 12 '23 09:12 rafaqz

For InferenceObjects.jl, we for the most part ignore that getindex returns a NamedTuple and primarily use DimStack as a NamedTuple of DimArrays with the extra check that Dims are consistent. The only exception is the Tables interface implementation. So I don't have a strong feeling either way, as I don't think it will affect my users much. My only concern is whether making an AbstractDimStack an AbstractArray might cause annoying defaults to be hit in other packages or method ambiguities.

Dec 18 '23 07:12 sethaxen

This is mostly already how it works! NamedTuple returned from indexing is not a new idea Im adding. But in Julia 1.11 NamedTuples should compile a lot faster and reduce your wide stack problem. If we still want Dicts we need a new stack mode that we can switch to. We could just change the new AbstractArray T in AbstractDimArray{T} from NamedTuble to Dict (with matching internals) and dispatch on that in getindex. We would never switch completely to Dict (or Dictionary) as it's much slower for small stack use cases.

It would be a huge relief if 1.11 can address the compilation issue with NamedTuples, as it's been one of my major performance gripes. I wasn't suggesting that we replace NamedTuple with a Dictionary entirely, only that we make it possible to substitute one for the other in order to accommodate wide stacks. However, perhaps Julia 1.11 will solve this issue?

Broadcast would return an AbstractDimArray of your return values. You could do things like:
A = (x -> x.a * x.b).(st) .* C
Which is different to
A = st.a .* st.b .* C
Because it will handle mixed-size layers. But yes its a little clunky accessing nested values in broadcasts.

That makes sense and would work well enough in general. However, would it be possible to return an AbstractDimStack if the broadcasted function returns a NamedTuple and an AbstractDimArray if it returns a scalar?

Dec 18 '23 09:12 JoshuaBillson

DimensionalData.jl DimensionalData.jl copied to clipboard

`AbstractDimStack <: AbstractArray`

DimensionalData.jl
DimensionalData.jl copied to clipboard