DimensionalData.jl
DimensionalData.jl copied to clipboard
Check that lookup properties match in broadcasts
@sethaxen and @felixcremer you may have opinions on this, as like cat this may break some downstream code.
I'm hitting annoying issues broadcasting rasters, some with the Y dimension reversed, some without (reversed Y is very common in spatial data). I shouldn't be able to broadcast them together when the order is reversed, but I can, because only the dimension type is checked with comparedims during broadcasts. When there is a lot of data handling code it can be hard to track down where this happens.
Checking order before broadcasting would reduce the potential for these bugs. But as broadcast can't have options passed to it easily, we need to make sure any changes to it are not debilitating for anyone.
So I am proposing:
- Prevent any broadcast where one dimension is
ForwardOrderedand the other isReverseOrdered. This doesn't seem controversial to me, and is a compile time comparison. It would be achieved with anorder=truekeyword tocomparedims.
Other possibilities that also make sense:
- Prevent broadcasts between any
OrderedandUnordered- this could be assumed with theorderkeyword, so we need to decide clearly either way. - Prevent broadcasts between
CategoricalandSampled. Probably this never makes sense to do. - Prevent broadcasts between
RegularandIrregular. Probably this also never makes sense to do, and catches different valued lookups without needing a runtime check.
A stronger check that I'm hesitant to do:
- Prevent broadcasts where the lookup values do not match and have the same
Locus. This one has a runtime cost, and potentially more serious problem where floating point error in ranges converted to vectors causes false==results in comparisons. That would be pretty annoying, so I'm hesitant to do this. Possiblyisapproxwould be more appropriate, but again there is no way for users to set the allowed tolerance. WithRegularlookups we could also just check that the start/stop values match to reduce the run-time cost, with Irregular we would need to check all values to get the same guarantee.
Things that should work but need better return type rules
Currently broadcast return types are determined by whichever is found first. Instead we need clear rules.
- Broadcasts between
PointsandIntervals- should always returnPoints? We can assume the interval value is at least a good approximation for the point, but we do not know if the point value is a good approximation for the whole interval. - Broadcasts between
NoLookupandSampledorCategoricalshould always returnSampledorCategorical?
To be clear everything should always be able to broadcast with NoLookup if the dimension itself matches.
Also if users want to break any of these rules, they can broadcast over parent(dimarray) for the second/third etc array and get the current behavior.
Thinking about this we should probably start by adding a warning for these things.
Thinking about this we should probably start by adding a warning for these things.
Agreed with this! RE options 1-4, I'm still thinking through whether these would cause any problems for my use cases. I agree option 5 seems too extreme.
I am currently working through the xarray 45 minute tutorial and realized, that in DimensionalData we would happily broadcast two arrays together which happen to have the same size in the dimensions with totally different dimension values. Could we make it so, that we could specify the behaviour globally in some dictionary so that people could change the default behaviour without having to specify it on every broadcast.
I think we could have a global Dict from which we would get the keyword arguments for the comparedims function so that we could change that from the user side. Similar to the setup of the YAXDefaults in YAXArrays https://github.com/JuliaDataCubes/YAXArrays.jl/blob/b4bf44866fbd9c5fe1a78f42d3ff0b454a4102a2/src/YAXArrays.jl#L12C1-L20C2
Yeah this has been a thought for a while. Currently we check the name and order but not the values. And sometimes you don't want to check the values (e.g. with NoLookup or with slight floating point differences).
So the global setting makes sense, it would be hard to incorporate into broadcast syntax.
It may be faster to have a Ref{NamedTuple{names,Tuple{Bool,Bool,etc}}} instead of a Dict so that access is faster, as we will always have the same set of options.
(We could also allow "error levels" like allow, warn, and error for when problems are found)