xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Cumulative examples

Open patrick-naylor opened this issue 2 years ago • 8 comments

Here I am trying to add docstring example to the cumulative functions. I did this by basically copying the method used for the reduction functions. Not sure at all if I did this correctly so i'm marking it a draft

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

patrick-naylor avatar Oct 10 '22 19:10 patrick-naylor

Very nice, this is something that's been on the TODO list! :)

I believe we wanted to rename generate_reductions.py to generate_aggregations.py so cumsum et al could be included and generated there as well. Is there a lot of work for you if try to merge these into that one?

Illviljan avatar Oct 10 '22 21:10 Illviljan

@Illviljan That is definitely something I could do. Are there any other methods I should be including in this?

patrick-naylor avatar Oct 10 '22 21:10 patrick-naylor

Right now, I think cumsum and cumprod is enough. numpy-groupies has a few more examples that I suppose we could support in the future.

Illviljan avatar Oct 10 '22 21:10 Illviljan

Great I'll start working on that. Shouldn't take too long

patrick-naylor avatar Oct 10 '22 21:10 patrick-naylor

Thanks @patrick-naylor !

Instead of using Dataset.reduce I think we want something like

def cumsum(..., dim):
	return xr.apply_ufunc(
	    np.cumsum if skipna else np.nancumsum,
		obj,
	    input_core_dims=[dim],
	    output_core_dims=[dim],
        kwargs={"axis": -1},
    )
    # now transpose dimensions back to input order

to fix #6528.

At the moment, this should also work on GroupBy objects quite nicely.

dcherian avatar Oct 11 '22 16:10 dcherian

Thanks @dcherian, I'll try to work that in. Is there a particular reason why there is no cumprod for GroupBy objects?

patrick-naylor avatar Oct 11 '22 18:10 patrick-naylor

Is there a particular reason why there is no cumprod for GroupBy objects?

Nope. Just wasn't added in :)

dcherian avatar Oct 11 '22 18:10 dcherian

def cumsum(..., dim):
	return xr.apply_ufunc(
	    np.cumsum if skipna else np.nancumsum,
		obj,
	    input_core_dims=[dim],
	    output_core_dims=[dim],
        kwargs={"axis": -1},
    )
    # now transpose dimensions back to input order

I'm running into an issue with variables without the core dimensions. Would it be better to do a work around in cumsum or in apply_unfunc like you mentioned in #6391

patrick-naylor avatar Oct 11 '22 22:10 patrick-naylor

I'm running into an issue with variables without the core dimensions. Would it be better to do a work around in cumsum or in apply_unfunc like you mentioned in #6391

While this is definitely worth an improvement, it is quite out of scope for this PR. Can you define Datasets such that this problem does not occur?

headtr1ck avatar Oct 14 '22 17:10 headtr1ck

I've merged the cumulative and reduction files into generate_aggregations.py and _aggregations.py. This uses the original version of reductions with an additional statement on the dataset methods that adds the original coordinates back in.

Using apply_ufunc and np.cumsum/cumprod has some issues as it only finds the cumulative across one axis which makes iterating through each dimension necessary. This makes it slower than the original functions and also causes some problems with the groupby method.

Happy for any input on how the method using apply_ufunc might be usable or on any ways to change the current method.

I'm getting a few issues I don't quite understand:

  • When running pytest on my local repository I get no errors but it's failing the checks here with a NotImplementedError
  • Black is having an issue with some of the strings in generate_aggregations. It's saying it cannot parse what should be valid code.

Thanks!

patrick-naylor avatar Oct 19 '22 23:10 patrick-naylor

I don't think you have flox installed, if it's not installed the code will take the old path. Do conda install flox and I think you'll get the NotImplementedError. Then you maybe have to change the default settings in cumsum so flox is not used.

Illviljan avatar Oct 21 '22 08:10 Illviljan

Thanks for taking this on @patrick-naylor ! This is a decent-sized project!

Using apply_ufunc and np.cumsum/cumprod has some issues as it only finds the cumulative across one axis which makes iterating through each dimension necessary.

np.cumsum only supports an integer axis so this is OK?

flox doesn't support cumsum at the moment (https://github.com/xarray-contrib/flox/issues/91) so we can delete that bit and just have one code path.

dcherian avatar Oct 21 '22 15:10 dcherian

@dcherian Do you think it would be better to finish this PR as the creation of _aggregations.py to give the cum methods better documentation? Then start a new one to fix #6528?

patrick-naylor avatar Oct 21 '22 17:10 patrick-naylor

o you think it would be better to finish this PR as the creation of _aggregations.py to give the cum methods better documentation? Then start a new one to fix https://github.com/pydata/xarray/issues/6528?

Sure that would be a good intermediate step. Let us know if you need help.

dcherian avatar Oct 21 '22 17:10 dcherian

@keewis do you understand this blackdoc error?

trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...............................................................Passed
debug statements (python)................................................Passed
mixed line ending........................................................Passed
autoflake................................................................Passed
isort....................................................................Passed
pyupgrade................................................................Passed
black....................................................................Passed
black-jupyter............................................................Passed
blackdoc.................................................................Failed
- hook id: blackdoc
- exit code: 123

error: cannot format /code/xarray/util/generate_aggregations.py: Cannot parse: 201:27: EOF in multi-line string
Oh no! 💥 💔 💥
215 files left unchanged, 1 file fails to reformat.

Illviljan avatar Oct 22 '22 11:10 Illviljan

@dcherian cumsum for resample fails for some reason do you have any ideas?

import numpy as np
import pandas as pd
import xarray as xr
da = xr.DataArray(
    np.array([1, 2, 3, 1, 2, np.nan]),
    dims="time",
    coords=dict(
        time=("time", pd.date_range("01-01-2001", freq="M", periods=6)),
        labels=("time", np.array(["a", "b", "c", "c", "b", "a"])),
    ),
)
ds = xr.Dataset(dict(da=da))
a = ds.resample(time="3M")
a.cumsum()


Traceback (most recent call last):

  File "C:\Users\J.W\anaconda3\envs\xarray-tests\lib\site-packages\spyder_kernels\py3compat.py", line 356, in compat_exec
    exec(code, globals, locals)

  File "c:\users\j.w\documents\github\xarray\xarray\util\untitled2.py", line 20, in <module>
    a.cumsum()

  File "C:\Users\J.W\Documents\GitHub\xarray\xarray\core\_aggregations.py", line 4921, in cumsum
    return self.reduce(

  File "C:\Users\J.W\Documents\GitHub\xarray\xarray\core\resample.py", line 395, in reduce
    return super().reduce(

  File "C:\Users\J.W\Documents\GitHub\xarray\xarray\core\groupby.py", line 1357, in reduce
    return self.map(reduce_dataset)

  File "C:\Users\J.W\Documents\GitHub\xarray\xarray\core\resample.py", line 342, in map
    return combined.rename({self._resample_dim: self._dim})

  File "C:\Users\J.W\Documents\GitHub\xarray\xarray\core\dataset.py", line 3646, in rename
    return self._rename(name_dict=name_dict, **names)

  File "C:\Users\J.W\Documents\GitHub\xarray\xarray\core\dataset.py", line 3587, in _rename
    raise ValueError(

ValueError: cannot rename '__resample_dim__' because it is not a variable or dimension in this dataset

Illviljan avatar Oct 22 '22 11:10 Illviljan

do you understand this blackdoc error?

so the reason is the position of the triple quotes: with

def f():
    """
    >>> 1 + 2"""
    pass

the extracted line will become: 1 + 2""", which when handed to black results in that error.

Technically, that's a bug (the triple quotes mark the end of the docstring), but one that is a bit tricky to fix: blackdoc is implemented using a line parser, which does not work too well if the transition happen somewhere within the line.

My guess is that it would have to start counting quotes which I've tried to avoid up until now since there's a lot of details to get right (see also keewis/blackdoc#145)

Edit: for now, I guess it would be fine to add something like The closing quotes are not on their own line. to the error message

keewis avatar Oct 22 '22 15:10 keewis

@keewis What does the numbers mean in Cannot parse: 201:27? Because I can't find any functions with docstrings in this file, it's just a bunch of multiline strings that are defined to variables and in my mind shouldn't trigger blackdoc.

Illviljan avatar Oct 22 '22 16:10 Illviljan

well, that's just it: there's no way to discriminate between docstrings and multi-line triple-quoted strings without parsing the python code (which is definitely out of scope), so blackdoc doesn't even attempt to. Instead, any line that begins with >>> will be assumed to be doctest lines. So with that, the numbers mean: line 201, character 27.

When created that file we decided to skip the generate_* file, but because the file was renamed that rule does not work anymore. Could you update the pre-commit configuration?

keewis avatar Oct 22 '22 19:10 keewis

@patrick-naylor, feel free to try out a better default example if you want.

Illviljan avatar Oct 26 '22 17:10 Illviljan

Thanks @Illviljan, @dcherian, and @keewis so much for the help.

patrick-naylor avatar Oct 26 '22 17:10 patrick-naylor