DimensionalData.jl
DimensionalData.jl copied to clipboard
Supporting alternative storage with AbstractDimStack
In InferenceObjects, I'd like to define an AbstractDimStack subtype that is not strongly typed but stores strongly typed AbstractDimArrays in a dictionary, similar to how DataFrame is not strongly typed but its columns are.
Surprisingly, this seems to mostly work. But there are a number of places where methods defined on AbstractDimStack assume its fields and assume its layers are stored in a NamedTuple. It would be handy to eliminate these assumptions so that other storage types may be used for AbstractDimStacks.
I've meant to add something like that for a while. Maybe we can add layers to the abstract type or use a trait.
What were you thinking of changing?
Maybe we can add layers to the abstract type or use a trait.
Can you elaborate? I don't follow.
What were you thinking of changing?
The 2 main reasons to do this are
- to reduce compile time lag when a user works with an
AbstractDimStack. Right now if they add a new layer to make a newDimStack, all methods that dispatch on that object will need to be compiled again since the type has now changed. This causes a noticeable lag on virtually all function calls. - to allow the user to interactively build the stack.
The 2nd reason motivates an object that only stores the metadata and a dictionary mapping layer names to layers. dims, refdims, layerdims, and layermetadata would not be assumed to be fields of the stack, because it makes more sense to compute them when requested. Similarly, they would not be keyword arguments to rebuild methods that take an AbstractDimStack.
While it might make sense to merge a NamedTuple and a DimStack, this doesn't necessarily make sense for an AbstractDimStack.
And as I mentioned, any places that assume one can map over the layers or their keys need to be changed, since map is not defined for dicts or sets.
There are probably more things to change that we would discover with an example implementation of such an object.
That all sounds good to me, I have the same reasons for wanting that for RasterStack, I just never had time to implement it.
Can you elaborate? I don't follow.
Both DimStack and AbstractRasterStack in Rasters.jl need the current behavior and I would prefer not to duplicate the code. But we could drop it down to an intermediate abstract type e.g. AbstractImmutableDimStack, and define AbstractMutableDimStack for you needs.
Another (maybe better) option is using a trait like stackmode(x) = IsMutable() for this distinction, then we are not tied to the type hierarchy, so e.g. a RasterStack could have either behavior depending on if it wraps a NamedTuple or a Dict (Sometimes it wraps something else so just those types are not enough of a distinction).
The benefit of the NamedTuple approach is that indexing the whole stack is very fast with very clean code for applications that use that (like point data extraction in Rasters.jl). The benefit of the Dict approach is that loading the first time is fast, and you can build iteratively. Its pretty similar to the split in e.g. DataFrames.jl and TypeTables.jl.
We should be able to have both options here, and switch between them easily.
That could be something like mutable(stack) and immutable(stack) methods to switch modes. In some Rasters.jl algorithms I would do that before running them and switch back at the end, because they rely on fast map and similar things in the hot paths.
As I suggested a while ago the conflation of Arrray/NamedTuple behaviour in DimStack is all up for debate, and could be separated into separate objects we switch between with methods. So feel free to play with ideas.
Just with the caveat that I will still need to access the behaviors we have now in some contexts.
One other point to make here is that compile time of DimStack was never really optimized. I think a lot could be done to reduce it.
The fields could be mostly Tuple rather than NamedTuple - we have the same keys stored in multiple places, rather than wrapping them as NamedTuple in the getter methods.
That would nearly halve the size of the type, which should help a lot.
Doing some snoop profiling there are some huge compilation bottlenecks in functions like e.g. uniquekeys recompiling for every stack size. We can fix a lot of these with `@nospecialise and using vectors and get much better performance with nametuple stacks.
I'll PR this soon.
There will still always be the "gradually building a stack" reason to use a Dict, and its worth having just for that. But it would be best if performance wasn't the reason for the choice.
There will still always be the "gradually building a stack" reason to use a
Dict, and its worth having just for that. But it would be best if performance wasn't the reason for the choice.
Agreed, yes! In fact, mergeing stacks works so well that if compilation wasn't such a big issue, getting a Dict-based stack would be pretty low priority.
Some questions for optimisation:
How many layers are in your stacks? What things are the slowest currently? What's an acceptable first construction time?
~~Bad news is just creating the NamedTuple the first time can be half a second, and I cant optimise that. The good news is its more than everything else.~~
Actually we can get construction down to 0.2 seconds even for 300 layers by avoiding making any complicated NamedTuples
How many layers are in your stacks?
I have no control over this, as the number of layers is equivalent to the number of parameters in the user's statistical model, which can be arbitrarily many; I would be surprised if it was as many as O(100). The examples we use for demo purposes have at most 16 layers.
What things are the slowest currently?
I haven't profiled this. I notice considerable lag on the first time performing basically every operation. (we do a lot of mapping over each layer to reduce it). Also the show method is quite slow.
What's an acceptable first construction time?
I guess it depends on the size of the stack. If a small stack (a handful of small arrays) takes seconds to construct, that's too slow. But the user probably notices the other lags more than construction lag (formatting the arrays correctly as input to the NamedTuple is probably more expensive than the NamedTuple constructor), since that happens during the interactive session.
Bad news is just creating the
NamedTuplethe first time can be half a second, and I cant optimise that.
Is that the current state of things? A half a second is probably not awful. But e.g. if a user mergers two DimStacks, which merges the layers, does the construction of that NamedTuple also take half a second?
The good news is its more than everything else.
:tada:
I've just looked into this more and the main problem is if we make a NamedTuple of DimArray.
Currently everything goes through the DimStack constructor for NamedTuple{<:DimArray}.
But this is a very complicated type with dimensions for every single array. We can just skip around it and construct the DimStack directly with the final NamedTuple{<:AbstractArray} from a Vector{<:DimArray} and pay most of the compilation only once for any number of layers.
Reducing the complexity of the DimStack type was one reason for the design being how it is (with dims separate from arrays), I just wasn't thinking about that we were generating that type complexity on the way to constructing the final object anyway, but we didn't have to.
For only 16 layers things should be pretty fast after these changes. I'm getting 0.06 seconds for the first run of DimStack, 0.0001 for the second run.
map also goes through NamedTuple, but we can skip around that too and map over tuples, wrapping the result as NamedTuple.
But I do find it odd that NamedTuple is so much slower to compile than Tuple.