AxisArrays.jl icon indicating copy to clipboard operation
AxisArrays.jl copied to clipboard

Use value indexing by default

Open andreasnoack opened this issue 7 years ago • 17 comments

In the single use case I have for AxisArrays right now, it would be very convenient if the default indexing of an AxisArray used the values of the indices instead of the indices of the indices. E.g. if I do

x = AxisArray(Axis{:x}(-3:3))
x[[-2,1]]

it would be handy if it returned the second and fifth element of the AxisVector. You can always refer to the indices of the indices by using the Axis type. Could this work?

andreasnoack avatar May 17 '17 15:05 andreasnoack

I've started hacking on this on a fork here.

mlubin avatar Sep 11 '17 02:09 mlubin

I'm coming to the conclusion that supporting both by-axis-value and by-axis-index within the A[] is fraught with trouble. I think it's crucial to look at prior art here and decisions others have made.

Pandas used to support four different ways of indexing into their dataframes: df[], df.ix[], df.iloc, and df.loc. iloc is by index and loc is by value. ix used to do both, changing based upon the type of the index. They found this too confusing and decided to deprecate it completely. The more similar project for Python, xarray, has been developed more recently, and they, too, decided to completely detangle things here: http://xarray.pydata.org/en/stable/indexing.html.

If we decide to completely detangle these things we have two decisions to make:

  • What should the default A[] syntax do? Indices or labels?
  • How should we toggle between the two? We have several options here: it could be per-dimension with a macro or wrapper (like @v), or now that we have getproperty we could do the pythonic thing and use a "index accessor" property like A.iloc[]. The downside there is that it applies to all dimensions. When I dealt with the Pandas deprecation of ix, I found that it was most commonly used in cases like df.ix['label', 0] — that is, one dimension by label and one dimension by offset.

mbauman avatar Jan 18 '18 22:01 mbauman

I'm an user of AxisArrays and I like its approach a lot. On the other hand I'm just an average user with little knowledge of the inner functionality, so regarding the architectural decisions i fully trust the people like M. Bauman and Tim Holy. This post is only to show how I'm using them and why I find this package great.

I'm almost always using the AxisArrays package in combination with Unitful or with Images. In both cases I use AxisArray as a way to annotate physical meaning of the Integer indexes and dimensions. The huge advantage for me is to use AxisArray for steering the dispatch into dedicated functions, and the dispatch is really pleasure to write thanks to rdeits suggestion in discourse with constructs like

const QAxis{Name, Dim} = Axis{Name, AX} where AX <: AbstractArray{Q} where Q <: Dim
const TimeAxis = QAxis{S, Unitful.Time} where S;
const TimeArray{T, N, V} = AxisArray{T, N, V, <:Tuple{TimeAxis, Vararg{Axis}}};

const QAxisW{Name, Dim} = Axis{Name, AX} where AX <: Range{Q} where Q <: Dim
const TimeAxisW = QAxisW{S, Unitful.Time} where S;
const TimeWaveformArray{T, N, V} = AxisArray{T, N, V, <:Tuple{TimeAxisW, Vararg{Axis}}};
const TimeWaveform{T, V} = TimeWaveformArray{T, 1, V}

const EnergyAxis = QAxis{S, Unitful.Energy} where S;
const EnergySpectrumArray{T, N, V} = AxisArray{T, N, V, <:Tuple{EnergyAxis, Vararg{Axis}}};

and this definitions I use to define specialized functions - some of them expecting Axis of StepRange kind (FIR, image processing, ...), other can work with arbitrary axis values (after e.g removal of outliers) such as plotting. I know I can define my own type and do the dispatch on its type - but this way it is for free, all playing just well with the rest of Julia.

A second usecase is when I use it instead of DataFrames - just that accessing any column of my measurement dataset is so simple, my main axis (time, or energy, or length) is still present and propagated - but I can pass it to any function expecting an AbstractArray and it just works without explicitly writing any conversions, any macro, just pure math-like syntax.

So my opinion is:

  • AxisArray is a AbstractArray thus default indexing without axis specification arr[1, 1], arr[:, 1], arr[1:end-5, 1] should be of the underlying array (I've even an use case I use a sparse array for spectrum representation in a huge datacube - works great!)
  • AxisArray is accessed by "dimension name" be it :Red, :time -> kind of DataFrames-light use case
  • AxisArray is accessed by "dimension name" and an Integer index -> again index into the underlying array
  • Any access to index or range through the Axis mapping is an opt-in for methods that expect and rely on the added information provided by the Axes, can dispatch only on a specific type of Axis and can opt-in a different way of indexing, e.g. what you propose with index accessor arr.ax[:time, 5u"ns"], arr.ax[5u"ns"], arr.ax[5u"ns"..15u"ns"] for explicit axis at given value, implicit axis on value and for implicit axis by range.

My feeling is, the three characters .ax to make sure indexing is by label is wort typing and not causing big pain for readability.

But again, don;t put too much weight to my examples and suggestions.

Petr-Hlavenka avatar Jan 19 '18 08:01 Petr-Hlavenka

Since the upgrade-to-0.7 train is at full speed now and the exported name axes conflicts with Base, this issue has gone critical. Time to make a decision.

My proposal is to orthogonalize the concepts as much as possible, using the following new packages:

  • NamedAxisIndexing: all this does is map a name to a dimension slot. It's effectively like Base.PermutedDimsArray except the axis names can be anything. Any other behavior is a property of the object you wrap. This package could have one wrapper NamedAxisArray <: AbstractArray (for integer-indexing) and another NamedAxisDict <: AbstractDict (for value indexing). (pity we don't have traits, then we could have just one wrapper...)
  • RangedAxisArrays: this adds range metadata to each axis. These are still AbstractArrays and you index them with integers. The axis metadata is indexed by slot number rather than symbol or anything else.
  • People who want to index by value can create a new package or standardize on one of the others (AssociativeArrays, NamedArrays, DataCubes, etc.)

Either of the latter two can be combined with NamedAxisIndexing at the user's or package author's option. That way everyone gets to leverage common functionality without there being so much angst over what things should actually mean.

We might need a small glue package (AnnotatedAxisArrays?) that defines trivial implementations of core trait functions that all of these packages can extend for their specific types.

timholy avatar Jul 05 '18 16:07 timholy

I'm not sure what the difference between NamedAxisArray and NamedAxisDict is. Are these for the data arrays or for axes?

iamed2 avatar Jul 05 '18 18:07 iamed2

The key point is the subtype relationship, NamedAxisArray <: AbstractArray. If you create a container that indexes by axis-range values (e.g., a[3.2s, 4.4mm]), it's not so obvious that you should be calling it a subtype of AbstractArray since we make the assumption in lots of places that you can index them by integers. But you could call that kind of object an AbstractDict.

timholy avatar Jul 05 '18 18:07 timholy

It sounds like NamedAxis* are for mapping dimension names to dimension numbers only. In that case, shouldn't the wrapper not care how indexing works on the wrapped structure, and just generated reordered calls to getindex and setindex? Or is there extra functionality in this hypothetical package that I'm missing?

iamed2 avatar Jul 05 '18 19:07 iamed2

You have to make a decision about subtyping when you define the structure:

struct NamedAxisArray{T,N,A<:AbstractArray,S<:NTuple{N,Symbol}} <: AbstractArray{T,N}
    data::A
end

# Anything that isn't an Array but supports getindex/setindex!
struct NamedAxisDict{K,V,D,S<:Tuple{Vararg{Symbol}} <: AbstractDict{K,V}
    data::D
end

but the implementations of getindex and setindex! are identical for both types. It's really just so that the wrapped objects have the appropriate isa relationships.

Like I said, if we were trait-based rather than inheritance-based then this wouldn't be an issue, since the isarray trait could be defined isarray(W::NamedAxisWrapper) = isarray(W.data).

timholy avatar Jul 05 '18 20:07 timholy

Hmm, though I realized there's still nothing to prevent someone for asking for value-based lookup with a RangedAxisArray. Really, is there anything wrong with things the way they are now? We have atvalue and atindex, these do allow one to index by value unambiguously and mix integer-indexing and value-indexing freely. Now I'm a little less certain we need to do the split in https://github.com/JuliaArrays/AxisArrays.jl/issues/84#issuecomment-402784324.

Overall I have the perception seems that existential angst about this package not quite being right is holding it back. Personally, I don't share this angst because I really do think this package should just be an axis metadata wrapper.

timholy avatar Jul 05 '18 21:07 timholy

In my mind AxisArrays has these basic levels of processing (let me know if I'm missing anything):


A. Index into an AxisArray with integers/ranges: - like a normal AbstractArray

B. Index into an AxisArray with positional indexes: - translate value indexes into integer/range indexes, leave non-value indexes untouched - index into the AxisArray with the results using A

C. Index into an AxisArray with named indexes: - translate from names to positions, fill the remaining indices with : - index into the AxisArray with the results using B


I see B as handled with axisindices right now, which uses Dimensional and Categorical traits to try and guess at whether to return arrays or ranges of indices, and then perform the translation using code that either assumes the axis is sorted or doesn't. I think the main complication in figuring out how to structure AxisArrays is capturing this behaviour, and abstracting this procedure out is the key to simplifying AxisArrays.

Does that sound right?

iamed2 avatar Jul 05 '18 21:07 iamed2

I agree with most of that, but I'd say that C (named axes) is actually orthogonal to B (value-indexing), in the sense that you can use integer/range indexing with named axes.

timholy avatar Jul 06 '18 01:07 timholy

I agree, that makes sense

iamed2 avatar Jul 06 '18 02:07 iamed2

Matt has pointed out that sometimes AxisArrays.to_index will produce AxisArray{T, N} indexes sometimes (where N>=2) as in the README example. Since the output dimensionality and axis names depend on the index dimensionality and type, I think this means that named indexing and at least the indexing-by-value interface need to live in the same package.

iamed2 avatar Aug 07 '18 12:08 iamed2

Following the lead of a proposal I just made for Interpolations, perhaps we should make A[i...] be "normal" indexing and A(position...) be value-indexing. For handling a mix, we could export a Index wrapper and use A(3.2mm, Index(5), 8.4s) as an interface.

The advantage here is that AxisArrays could continue to be "normal" AbstractArrays, using integer indexing and where axes (the one in Base, not the one here) returns an AbstactUnitRange{<:Int}. However, other than obviating the use of end and similar constructs (though we could use https://github.com/JuliaArrays/EndpointRanges.jl), it would offer nice syntax for value-indexing, too. Perhaps the new trait name of AxisArrays.axes should be valueaxes?

timholy avatar Aug 23 '18 12:08 timholy

I find the function call interface to make much more sense for interpolations (where it feels like I'm actually executing a calculation to determine the value) vs. here (where it feels more like lookup). I know the distinction there is rather fuzzy, but I find it meaningful. I'd argue that the tension we feel here is because we're trying to be two data structures at once, both a [Sorted|Ordered]CartesianDict with named dimensions and an array with named and valued axes. While we could move the by-value indexing behind a function call interface, I'm more tempted to rip it out altogether and move it into a CartesianDict package. The dictionary package could be a wrapper around this one, just exposing a different API. It'd effectively be the dense/rectangular version of JuliaDB's NDSparse.

I don't think @andyferris follows this repository, so I'm pinging you because I know you've put in lots of effort on blurring the lines between dictionaries and arrays. I'd be interested in hearing your thoughts here.

mbauman avatar Aug 23 '18 22:08 mbauman

Thanks @mbauman for the link.

Yes, this is very interesting. I like the way this mixes up arrays, offset arrays, generalizations of dictionaries and Cartesian indices all in the one place :). A few (some of them obvious) things:

  • If you want to not quite follow the AbstractArray interface, that's great, just don't subtype AbstractArray. We can make a new supertype with its own interface; things like broadcast can be extended these days pretty easily. In this case you could use something like parent(A)[i...] for array-style indexing, no?
  • It seems challenging to have more than one kind of keys for a container. Especially when you throw in multidimensional indexing into the mix. What exactly should keys return? Already, I'm pretty confused by having both linear and Cartesian indices in AbstractArray but having arbitrary column labels (which themselves may be containers) seems like it could lead to complexity, fast.
  • If you do have more than one "kind" of key, to limit complexity it would seem nice to have a very simple transformation between them. For example A[1, 2] being equivalent to A[CartesianIndex(1, 2)], or something like that, having keys return a container of CartesianIndex - this would seem entirely self-consistent (the first transformation is predictable syntactic sugar). There's also the new A[a = 1, b = 2] syntax available for use soon, so a similar transformation for named axes seems quite reasonable.
  • When the transformation is more complex, I'd tend to use a different generic function to getindex, or it gets hard to write down and understand all the guarantees that keys and getindex should provide.
  • For AbstractDicts (and in general, any containers with keys that themselves could be containers) I feel we will need to split getindex and getindices into two generic functions and syntaxes. My understanding is that @mbauman is on the same page here, and the A.[indices] syntax should be free for the taking in v1.x?
  • I'd recommend investigating a new interface for fast iteration over the elements of one or more idexable containers (suitable for dictionaries, etc). Obviously you want your axis arrays to be fast. AbstractArrays use eachindex to return the faster of Cartesian indices or linear indices for using in your loops. Dictionaries may need tokens, broadcasting multiple over axis arrays with the same axes should use the integer indices rather than symbols, etc. For AbstractDict this is complicated by the fact that your fast "token" type could also be a valid index... thus I would suggest something along the lines of eachtoken and gettoken, or whatever.
  • The function call interface is interesting. These days I'm basically confused why arrays/dictionaries/etc aren't just maps from index to value (i.e. functions). I can only say that when I switched from MATLAB to Julia I found the Julia code easier to read because of the distinction between [] and ().

I haven't had the time to experiment with any of this lately - but what I'd like to see is basically a dictionary that can be as fast as an array, and supporting multidimensional indices (via positional or named axes - I'm hoping that Symbol for axis names should be sufficient).

andyferris avatar Aug 25 '18 05:08 andyferris

It's been almost 3 years since this first opened - would it be safe to say that indexing by integers will remain a special case for AxisArrays?

junglegobs avatar Mar 24 '20 12:03 junglegobs