z5 icon indicating copy to clipboard operation
z5 copied to clipboard

Axis ordering

Open funkey opened this issue 6 years ago • 11 comments

As @clbarnes noted here, user-set attributes that refer to coordinates (like resolution or offset) will end up with different axis orderings depending on which implementation was used. A z5 user will set resolution as (z, y, x), an n5 user will set it as (x, y, z).

I see two possible ways to deal with that:

  1. Delegate the responsibility of being consistent in attribute axis orderings to the users. This is what we do right now. This requires users of one of the backends to accept that the dataset shape will be reported as (z, y, x), but their attributes have to be written as (x, y, z) (or the other way around).
  2. Let users mark some of the attributes as axis-related, e.g., by introducing a class AxisTuple and have ds.attrs['resolution'] = z5.AxisTuple(z, y, x). The concrete implementation will then make sure the ordering is consistent with the storage.

I don't really like any of the two, but if I had to chose I'd say explicit is better, so 2.

Other suggestions?

funkey avatar Mar 20 '18 13:03 funkey

I like the explicit class going in, but getting it out would be more awkward - you'd need to json-serialise the whole axistuple, then check every attribute on the way out to see whether it was one of those. That serialisation may also make little sense to people interoperating between different N5 implementations.

clbarnes avatar Mar 20 '18 13:03 clbarnes

As @clbarnes has already pointed out, for option 2, de-serializing from json is tricky. If we serialize a new class, it will not be possible to read the attribute from a different n5 implementation. We could just add some indicator to the string (e.g. appending "_n5Axes") to the attribute and then resolve this in the attribute manager with some pre / postprocessing of keys. This way, the attribute would still be accessible in other n5 implementations, but with a different name.

Another option would be to automatically invert attributes that are lists or tuples of the same length as the dimension in the AttributeManager when using n5 format. This, however, would only work for datasets, because for groups or files we don't have any dimension information.

constantinpape avatar Mar 20 '18 15:03 constantinpape

Regarding option 2, one could also non-invasively store the names of axis-tuple attributes in a reserved attribute, like __axis_tuples.

Option 3 (automatically inverting) I would consider the most harmful. Users might want to store tuples of int that have nothing to do with axes.

Here is another suggestion:

  1. Introduce axis specifications. A z5/n5 dataset could allow (or even require) axes to be described by at least a name (like x, y, z, t, channel) and additional attributes (like size, offset, resolution, etc.). Users should then be made aware that axis related information should be stored in the axis specification.

funkey avatar Mar 20 '18 15:03 funkey

I agree that automatic inverting based on the length of the tuple is risky. For the case where z5 is doing both the input and output, I like the idea of having an extra attribute to store the names of what has and hasn't been inverted. It would still be up to the user to manage that, but it at least means they can do it once and then forget about it, rather than have to worry every time they IO. I'd name it something which makes it clear that it's a z5 thing.

@funkey - would you envision named axes becoming part of the N5 spec? I like it conceptually (I've transitioned whole projects over to use vigra instead of numpy for similar reasons). That would then mean that resolution etc. should/would be stored as a dict in order to be implementation-agnostic (again, it's what I've done with other orientation-heavy projects).

xarray uses this sort of approach, I think. It primarily uses netCDF as its underlying file format but it looks like it has zarr bindings too.

clbarnes avatar Mar 20 '18 15:03 clbarnes

Vigra has a similar concept with axistags. I personally don't like it, because they add a lot of implementation overhead and still require user intervention (the issue here is that numpy functions still work on vigra arrays, but don't share the axistags information).

Anyway, for the issue at hand, I think the simplest and clean solution would be to specify for certain attributes if they should be serialized in n5-axis order, as @funkey has suggested. We can keep a list of these attributes and serialize this list with a seperate key. This way, different n5 implementations can read the same keys in the correct axis order.

constantinpape avatar Mar 20 '18 16:03 constantinpape

I wouldn't go that far to suggest a new array type with axis meta-data. z5 can still return plain ndarrays. But it might be worth thinking about having a dedicated place to name axes and store attributes about them in the datasets. Imagine that with a dataset ds you could just query ds.axes in the same way we currently query ds.attrs. ds.axes['z'].index would then tell us which dimension is z, for example.

If we wanted to return an array type with meta-data, the axis extension to the N5 spec would allow conversion to and from xarray.

funkey avatar Mar 20 '18 16:03 funkey

Oh, I see. Yes, the ds.axes part sounds like a neat idea.

I heard of xarray before, but never worked with it. If they have a constructor that takes a ndarray and some way of specifying the order, the conversion could just be done by the user, by passing the axes object in addition to the data stored as ndarray.

Something similar would be possible with vigra, which supports constructing a vigra array from a ndarray and axistags.

constantinpape avatar Mar 20 '18 17:03 constantinpape

Her are my 5ct: Arrays aren't vectors, and JSON has no vector type. We can therefore not automatically invert axis orders, regardless of whether they have the same length as the containing dataset or whatever fuzzy naming convention may suggest this. Keep in mind that JSON has limited type flexibility and that is the common ground on which we operate with N5. @funkey 's first option to have the user API declare what the expected axis order is when arrays are used, sounds perfectly reasonable to me. It is my fault that I inverted the offset and resolution attributes in a bunch of N5 copies of CREMI datasets for purely aesthetical reasons without spending more thought. I.e., what I generated is data in the CREMI-N5 format, not in the CREMI-HDF5 format.

Using dictionaries with named axes is another option but comes with some overhead in parsing and mem-copying, not that this would matter much.

On the other end of things where we have control, I am fine with adding a per dataset attribute to switch between row or column major for datasets, so that an existing API can decide about their preference. However, this is only a hint for the consumer anyway and has no impact on what's in the data. Z5 and N5-HDF5 unfortunately invert the axis order in the dataset access API relative to what's going on in blocks, this would be the right time to change this.

Please correct me if I am confusing things.

axtimwalde avatar Mar 22 '18 14:03 axtimwalde

We just had a long discussion about this (and things are, as usual, more complicated as expected). The gist of it: we add new functionality to the different APIs, to explicitly read and write VectorAttributes, i.e. attributes that are 1 dimensional arrays and should be interpreted in the same axis order as the dataset.

For reading n5 via z5py, this would mean the following: the AttributeManager (attrs) gets the functions getAttribute, setAttribute, getVector, setVector, where get/setAttribute behaves as [] does right now. get/setVector will read / write lists in accordance with the n5 specification. This means that they are stored in column major, but (in python) returned as row major. In practice, that means that these lists are reversed internally when reading / writing.

The recommended usage will be to use the new getters / setters to be explicit, but [] will still be supported and just default to get/setAttribute.

constantinpape avatar Mar 22 '18 15:03 constantinpape

We had a longer lab-meeting about this and we came up with this proposal:

  • N5 is column major.
  • N5-HDF5 translates the row major HDF5-API to N5 column major.
  • Z5 translates the N5 column major API to row major.

This is valid and consistent for dataset access and dataset attributes.

For vector attributes that are not part of the core API, we will add vector access methods that:

  • In N5, reading from N5 FS, N5 AWS, N5 Google, will read arrays as column major and return arrays as column major.
  • In N5, reading from HDF5, will read arrays as row major and return arrays as column major.
  • In Z5, reading from N5 FS, will read arrays as as column major and return arrays as row major.

When copying from column major formats N5 FS, N5 AWS, N5 Google, to row major formats N5 HDF5 or the other way round, the user has to provide a list of arrays that will be treated as vectors and therefore inverted.

axtimwalde avatar Mar 22 '18 15:03 axtimwalde

FYI, I did this in N5 FS and it feels so wrong that we decided to not do it. @constantinpape will introduce these methods into Z5 but will also have a look at how not-row-majoring the column-major API would work. In N5, we will not add these methods because they are a hot-fix for a problem that also manifests elsewhere (e.g. annotation vector lists stored as 2D datasets in CREMI) and wouldn;t be addressed. So, we are back at @funkey's solution 1: N5 ist just a container. A format (such as CREMI) is developed on top of this and specifies how vectors stored as datasets or arrays are interpreted.

axtimwalde avatar Mar 24 '18 00:03 axtimwalde