mne-python icon indicating copy to clipboard operation
mne-python copied to clipboard

Facilitate export to XArray

Open christian-oreilly opened this issue 4 years ago • 9 comments

Describe the problem

XArray is very convenient when doing data exploration as it often allows for a more transparent code by using dimension names rather than indexes. It would be very useful if MNE was providing XArray export functions where it already provides exports to NumPy array. For example, whereas someone would currently use

ica.get_sources(epochs).get_data().mean(2).mean(0)

by adding a to_xarray() method, we could use the following alternative code

ica.get_sources(epochs).to_xarray().mean(["time", "epochs"])
#  or ica.get_sources(epochs).to_xarray().mean("time").mean("epochs")

Personally, I found the second version much easier to understand. With the first version, I would typically have to print the shape of the data to infer what axis corresponds to what to understand what operation is being performed. The first form (with dimension indexes) is also much more error-prone because all sorts of things can affect the axes ordering whereas dimension names and coordinates generally remain consistent and generate much more transparent errors when they don't.

christian-oreilly avatar May 17 '20 17:05 christian-oreilly

no objection. We have to_data_frame so we can add a to_xarray following the same approach (using nested import)

agramfort avatar May 17 '20 18:05 agramfort

Or maybe add a kwarg to get_data, output='ndarray' (default) or 'xarray'. I think I'd prefer that to a new function

larsoner avatar May 17 '20 20:05 larsoner

it's just harder to discover

agramfort avatar May 17 '20 20:05 agramfort

Indeed, but less API / kwarg duplication

larsoner avatar May 17 '20 20:05 larsoner

I’d prefer dedicated functions for each export. It’s how all other packages behave that I’ve ever encountered in my scientific Python stack.

hoechenberger avatar May 17 '20 20:05 hoechenberger

Any objections to adding xarray in the requirements.txt? I am working on a different PR related to this https://mne.discourse.group/t/exporting-surface-sources-as-nifti-volume/5518/3 I'd used xarray in that PR, but that would need adding this dependency. I am thinking if we were to implement this feature request (I can do a PR for this), it would require adding this dependency anyway. I think XArray is a great component of Python scientific stack and worth integrating. I want to check first as I understand that there are also valid arguments for keeping dependencies as light as possible.

christian-oreilly avatar Sep 18 '22 15:09 christian-oreilly

@christian-oreilly adding a dependency on the project is not cheap in terms of maintenance and technical debt. Can you motivate a bit this decision?

agramfort avatar Sep 18 '22 15:09 agramfort

I don't think you should need xarray to convert a surface to NIfTI at least

larsoner avatar Sep 18 '22 15:09 larsoner

@agramfort Yes, I agree that adding dependencies must not be a rash decision. Some thoughts on that: 1 - The XArray export functionality would require importing XArray. You were thinking of it more as an extra/optional import required and checked for only when calling such an export function? 2 - I do use it for the intended PR for things like that:

    source_data = xr.DataArray(stc.data, dims=["sources", "time"]) \
                    .rolling(time=time_downsample_factor, center=True) \
                    .mean()[:, time_downsample_factor//2:-time_downsample_factor//2:time_downsample_factor]

Of course, there is nothing there that cannot be implemented in plain Numpy or with Pandas (since in this specific context, I am only in 2D), but I think XArray significantly adds to readability as it allows operating on axes not using integers but with named dimensions. Compares

    source_data.transpose("time", "sources")
    source_data.sel(time=source_data.time > 0).mean("time")

with

    source_data.transpose(1, 0):
    source_data[:, time > 0].mean(1)
    # Note also that here, time needs to be maintained
    # as a separate variable. XArray can maintain it internally
    # as the coordinates for the given dimension.

for example. The former is explicit, whereas the latter requires introspecting the objects to understand what is what. As much as Pandas is a game changer (compared to using Numpy only) when doing data analysis in general, XArray is a game changer when working with multidimensional data that can be represented as dense blocks (e.g., time X channel X epochs X ...); less so if not dense like if stacking data across subjects (e.g., time X channel X epoch X subject) where there is a large variability in the numbers of epochs per subject. 3 - XArray is an open door for using Dask (distributed computing) to scale up analyses toward "bigger" data applications. 4 - In the long term, I think mne could benefit from using XArray for some of its internal operations (to make them simpler and more readable).

Note that I have no particular involvement with XArray (i.e., I have no conflicting interests). I am just a happy customer :)

@larsoner Agreed. It is not a blocker for converting a surface to NIfTI. If available, I'll use it. If not, I'll get around. So this does not need to be settled for this specific PR.

christian-oreilly avatar Sep 18 '22 15:09 christian-oreilly