AxisArrays.jl
AxisArrays.jl copied to clipboard
Use value indexing by default
In the single use case I have for AxisArrays
right now, it would be very convenient if the default indexing of an AxisArray
used the values of the indices instead of the indices of the indices. E.g. if I do
x = AxisArray(Axis{:x}(-3:3))
x[[-2,1]]
it would be handy if it returned the second and fifth element of the AxisVector
. You can always refer to the indices of the indices by using the Axis
type. Could this work?
I've started hacking on this on a fork here.
I'm coming to the conclusion that supporting both by-axis-value and by-axis-index within the A[]
is fraught with trouble. I think it's crucial to look at prior art here and decisions others have made.
Pandas used to support four different ways of indexing into their dataframes: df[]
, df.ix[]
, df.iloc
, and df.loc
. iloc
is by index and loc
is by value. ix
used to do both, changing based upon the type of the index. They found this too confusing and decided to deprecate it completely. The more similar project for Python, xarray
, has been developed more recently, and they, too, decided to completely detangle things here: http://xarray.pydata.org/en/stable/indexing.html.
If we decide to completely detangle these things we have two decisions to make:
- What should the default
A[]
syntax do? Indices or labels? - How should we toggle between the two? We have several options here: it could be per-dimension with a macro or wrapper (like
@v
), or now that we havegetproperty
we could do the pythonic thing and use a "index accessor" property likeA.iloc[]
. The downside there is that it applies to all dimensions. When I dealt with the Pandas deprecation ofix
, I found that it was most commonly used in cases likedf.ix['label', 0]
— that is, one dimension by label and one dimension by offset.
I'm an user of AxisArrays and I like its approach a lot. On the other hand I'm just an average user with little knowledge of the inner functionality, so regarding the architectural decisions i fully trust the people like M. Bauman and Tim Holy. This post is only to show how I'm using them and why I find this package great.
I'm almost always using the AxisArrays package in combination with Unitful or with Images. In both cases I use AxisArray as a way to annotate physical meaning of the Integer indexes and dimensions. The huge advantage for me is to use AxisArray for steering the dispatch into dedicated functions, and the dispatch is really pleasure to write thanks to rdeits suggestion in discourse with constructs like
const QAxis{Name, Dim} = Axis{Name, AX} where AX <: AbstractArray{Q} where Q <: Dim
const TimeAxis = QAxis{S, Unitful.Time} where S;
const TimeArray{T, N, V} = AxisArray{T, N, V, <:Tuple{TimeAxis, Vararg{Axis}}};
const QAxisW{Name, Dim} = Axis{Name, AX} where AX <: Range{Q} where Q <: Dim
const TimeAxisW = QAxisW{S, Unitful.Time} where S;
const TimeWaveformArray{T, N, V} = AxisArray{T, N, V, <:Tuple{TimeAxisW, Vararg{Axis}}};
const TimeWaveform{T, V} = TimeWaveformArray{T, 1, V}
const EnergyAxis = QAxis{S, Unitful.Energy} where S;
const EnergySpectrumArray{T, N, V} = AxisArray{T, N, V, <:Tuple{EnergyAxis, Vararg{Axis}}};
and this definitions I use to define specialized functions - some of them expecting Axis of StepRange kind (FIR, image processing, ...), other can work with arbitrary axis values (after e.g removal of outliers) such as plotting. I know I can define my own type and do the dispatch on its type - but this way it is for free, all playing just well with the rest of Julia.
A second usecase is when I use it instead of DataFrames - just that accessing any column of my measurement dataset is so simple, my main axis (time, or energy, or length) is still present and propagated - but I can pass it to any function expecting an AbstractArray and it just works without explicitly writing any conversions, any macro, just pure math-like syntax.
So my opinion is:
- AxisArray is a AbstractArray thus default indexing without axis specification
arr[1, 1], arr[:, 1], arr[1:end-5, 1]
should be of the underlying array (I've even an use case I use a sparse array for spectrum representation in a huge datacube - works great!) - AxisArray is accessed by "dimension name" be it
:Red, :time
-> kind of DataFrames-light use case - AxisArray is accessed by "dimension name" and an Integer index -> again index into the underlying array
- Any access to index or range through the Axis mapping is an opt-in for methods that expect and rely on the added information provided by the Axes, can dispatch only on a specific type of Axis and can opt-in a different way of indexing, e.g. what you propose with index accessor
arr.ax[:time, 5u"ns"], arr.ax[5u"ns"], arr.ax[5u"ns"..15u"ns"]
for explicit axis at given value, implicit axis on value and for implicit axis by range.
My feeling is, the three characters .ax
to make sure indexing is by label is wort typing and not causing big pain for readability.
But again, don;t put too much weight to my examples and suggestions.
Since the upgrade-to-0.7 train is at full speed now and the exported name axes
conflicts with Base
, this issue has gone critical. Time to make a decision.
My proposal is to orthogonalize the concepts as much as possible, using the following new packages:
- NamedAxisIndexing: all this does is map a name to a dimension slot. It's effectively like Base.PermutedDimsArray except the axis names can be anything. Any other behavior is a property of the object you wrap. This package could have one wrapper
NamedAxisArray <: AbstractArray
(for integer-indexing) and anotherNamedAxisDict <: AbstractDict
(for value indexing). (pity we don't have traits, then we could have just one wrapper...) - RangedAxisArrays: this adds range metadata to each axis. These are still AbstractArrays and you index them with integers. The axis metadata is indexed by slot number rather than symbol or anything else.
- People who want to index by value can create a new package or standardize on one of the others (AssociativeArrays, NamedArrays, DataCubes, etc.)
Either of the latter two can be combined with NamedAxisIndexing at the user's or package author's option. That way everyone gets to leverage common functionality without there being so much angst over what things should actually mean.
We might need a small glue package (AnnotatedAxisArrays?) that defines trivial implementations of core trait functions that all of these packages can extend for their specific types.
I'm not sure what the difference between NamedAxisArray
and NamedAxisDict
is. Are these for the data arrays or for axes?
The key point is the subtype relationship, NamedAxisArray <: AbstractArray
. If you create a container that indexes by axis-range values (e.g., a[3.2s, 4.4mm]
), it's not so obvious that you should be calling it a subtype of AbstractArray
since we make the assumption in lots of places that you can index them by integers. But you could call that kind of object an AbstractDict
.
It sounds like NamedAxis*
are for mapping dimension names to dimension numbers only. In that case, shouldn't the wrapper not care how indexing works on the wrapped structure, and just generated reordered calls to getindex and setindex? Or is there extra functionality in this hypothetical package that I'm missing?
You have to make a decision about subtyping when you define the structure:
struct NamedAxisArray{T,N,A<:AbstractArray,S<:NTuple{N,Symbol}} <: AbstractArray{T,N}
data::A
end
# Anything that isn't an Array but supports getindex/setindex!
struct NamedAxisDict{K,V,D,S<:Tuple{Vararg{Symbol}} <: AbstractDict{K,V}
data::D
end
but the implementations of getindex
and setindex!
are identical for both types. It's really just so that the wrapped objects have the appropriate isa
relationships.
Like I said, if we were trait-based rather than inheritance-based then this wouldn't be an issue, since the isarray
trait could be defined isarray(W::NamedAxisWrapper) = isarray(W.data)
.
Hmm, though I realized there's still nothing to prevent someone for asking for value-based lookup with a RangedAxisArray
. Really, is there anything wrong with things the way they are now? We have atvalue
and atindex
, these do allow one to index by value unambiguously and mix integer-indexing and value-indexing freely. Now I'm a little less certain we need to do the split in https://github.com/JuliaArrays/AxisArrays.jl/issues/84#issuecomment-402784324.
Overall I have the perception seems that existential angst about this package not quite being right is holding it back. Personally, I don't share this angst because I really do think this package should just be an axis metadata wrapper.
In my mind AxisArrays has these basic levels of processing (let me know if I'm missing anything):
A. Index into an AxisArray with integers/ranges: - like a normal AbstractArray
B. Index into an AxisArray with positional indexes: - translate value indexes into integer/range indexes, leave non-value indexes untouched - index into the AxisArray with the results using A
C. Index into an AxisArray with named indexes:
- translate from names to positions, fill the remaining indices with :
- index into the AxisArray with the results using B
I see B as handled with axisindices
right now, which uses Dimensional
and Categorical
traits to try and guess at whether to return arrays or ranges of indices, and then perform the translation using code that either assumes the axis is sorted or doesn't. I think the main complication in figuring out how to structure AxisArrays is capturing this behaviour, and abstracting this procedure out is the key to simplifying AxisArrays.
Does that sound right?
I agree with most of that, but I'd say that C (named axes) is actually orthogonal to B (value-indexing), in the sense that you can use integer/range indexing with named axes.
I agree, that makes sense
Matt has pointed out that sometimes AxisArrays.to_index will produce AxisArray{T, N} indexes sometimes (where N>=2) as in the README example. Since the output dimensionality and axis names depend on the index dimensionality and type, I think this means that named indexing and at least the indexing-by-value interface need to live in the same package.
Following the lead of a proposal I just made for Interpolations, perhaps we should make A[i...]
be "normal" indexing and A(position...)
be value-indexing. For handling a mix, we could export a Index
wrapper and use A(3.2mm, Index(5), 8.4s)
as an interface.
The advantage here is that AxisArrays could continue to be "normal" AbstractArray
s, using integer indexing and where axes
(the one in Base, not the one here) returns an AbstactUnitRange{<:Int}
.
However, other than obviating the use of end
and similar constructs (though we could use https://github.com/JuliaArrays/EndpointRanges.jl), it would offer nice syntax for value-indexing, too. Perhaps the new trait name of AxisArrays.axes
should be valueaxes
?
I find the function call interface to make much more sense for interpolations (where it feels like I'm actually executing a calculation to determine the value) vs. here (where it feels more like lookup). I know the distinction there is rather fuzzy, but I find it meaningful. I'd argue that the tension we feel here is because we're trying to be two data structures at once, both a [Sorted|Ordered]CartesianDict with named dimensions and an array with named and valued axes. While we could move the by-value indexing behind a function call interface, I'm more tempted to rip it out altogether and move it into a CartesianDict
package. The dictionary package could be a wrapper around this one, just exposing a different API. It'd effectively be the dense/rectangular version of JuliaDB's NDSparse.
I don't think @andyferris follows this repository, so I'm pinging you because I know you've put in lots of effort on blurring the lines between dictionaries and arrays. I'd be interested in hearing your thoughts here.
Thanks @mbauman for the link.
Yes, this is very interesting. I like the way this mixes up arrays, offset arrays, generalizations of dictionaries and Cartesian indices all in the one place :). A few (some of them obvious) things:
- If you want to not quite follow the
AbstractArray
interface, that's great, just don't subtypeAbstractArray
. We can make a new supertype with its own interface; things likebroadcast
can be extended these days pretty easily. In this case you could use something likeparent(A)[i...]
for array-style indexing, no? - It seems challenging to have more than one kind of keys for a container. Especially when you throw in multidimensional indexing into the mix. What exactly should
keys
return? Already, I'm pretty confused by having both linear and Cartesian indices inAbstractArray
but having arbitrary column labels (which themselves may be containers) seems like it could lead to complexity, fast. - If you do have more than one "kind" of key, to limit complexity it would seem nice to have a very simple transformation between them. For example
A[1, 2]
being equivalent toA[CartesianIndex(1, 2)]
, or something like that, havingkeys
return a container ofCartesianIndex
- this would seem entirely self-consistent (the first transformation is predictable syntactic sugar). There's also the newA[a = 1, b = 2]
syntax available for use soon, so a similar transformation for named axes seems quite reasonable. - When the transformation is more complex, I'd tend to use a different generic function to
getindex
, or it gets hard to write down and understand all the guarantees thatkeys
andgetindex
should provide. - For
AbstractDict
s (and in general, any containers with keys that themselves could be containers) I feel we will need to splitgetindex
andgetindices
into two generic functions and syntaxes. My understanding is that @mbauman is on the same page here, and theA.[indices]
syntax should be free for the taking in v1.x? - I'd recommend investigating a new interface for fast iteration over the elements of one or more idexable containers (suitable for dictionaries, etc). Obviously you want your axis arrays to be fast.
AbstractArray
s useeachindex
to return the faster of Cartesian indices or linear indices for using in your loops. Dictionaries may need tokens, broadcasting multiple over axis arrays with the same axes should use the integer indices rather than symbols, etc. ForAbstractDict
this is complicated by the fact that your fast "token" type could also be a valid index... thus I would suggest something along the lines ofeachtoken
andgettoken
, or whatever. - The function call interface is interesting. These days I'm basically confused why arrays/dictionaries/etc aren't just maps from index to value (i.e. functions). I can only say that when I switched from MATLAB to Julia I found the Julia code easier to read because of the distinction between
[]
and()
.
I haven't had the time to experiment with any of this lately - but what I'd like to see is basically a dictionary that can be as fast as an array, and supporting multidimensional indices (via positional or named axes - I'm hoping that Symbol
for axis names should be sufficient).
It's been almost 3 years since this first opened - would it be safe to say that indexing by integers will remain a special case for AxisArrays?