xarray Replace dataset scatter with the dataarray version

[ ] Requires #4820
[ ] Tests added
[ ] Passes pre-commit run --all-files
[ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
[ ] New functions/methods are listed in api.rst

Jul 19 '21 21:07 Illviljan

Hello @Illviljan! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file xarray/plot/plot.py:

Line 19:1: F401 '.utils._adjust_legend_subtitles' imported but unused Line 19:1: F401 '.utils._legend_add_subtitle' imported but unused Line 19:1: F401 '.utils.legend_elements' imported but unused

Comment last updated at 2021-11-05 06:38:18 UTC

Jul 19 '21 21:07 pep8speaks

Unit Test Results

        5 files         5 suites 40m 26s :stopwatch: 16 290 tests 14 546 :heavy_check_mark: 1 739 :zzz:   5 :x: 75 780 runs 68 978 :heavy_check_mark: 6 777 :zzz: 25 :x:

For more details on these failures, see this check.

Results for commit 48c0cde5.

:recycle: This comment has been updated with latest results.

Jul 19 '21 21:07 github-actions[bot]

Good idea to keep the dataarray scatter non-public for now, found quite a few bugs there. :) Turned out a little trickier than expected to get the tests passing, I'll continue pondering this next week.

Some thoughts before I forget them:

How important is the argument order for the plotting functions? It's been very difficult getting the line plot and scatter to behave the same because they have completely different order of inputs. Would be easy if we could just force named arguments?
How important are figure legends in facetgrid? Some tests breaks on this now but I'm not sure it's a good idea to change to figlegends instead of legend per plot.
We seem to be lacking tests regarding how the plot should look like. For example an acccidentally inverted plot didn't crash the tests.

Jul 25 '21 22:07 Illviljan

My 2 cents:

How important is the argument order for the plotting functions?

Obviously it would be nice to be able to get the arguments in the same order across functions, but I think we probably care more about not suddenly breaking backwards compatibility - any change to the order should technically require a deprecation cycle... That said standardizing something more consistent would be good.

How important are figure legends in facetgrid? Some tests breaks on this now but I'm not sure it's a good idea to change to figlegends instead of legend per plot.

Not really sure what the best thing to do is.

We seem to be lacking tests regarding how the plot should look like.

My understanding is that testing the displayed output of plotting functions is notoriously tricky and unreliable, hence when we currently test we interrogate properties of matplotlib objects. There are libraries that check images are correct, and @pytest.mark.flaky helps, but I'm not sure if it's worth it. Also we are merely wrapping matplotlib here - if we make efforts to check that the objects' properties are as expected then at some level we obviously have to trust that that corresponds to the same image.

For example an acccidentally inverted plot didn't crash the tests.

Is there no obvious object property test that would have caught this?

Jul 26 '21 16:07 TomNicholas

The scatter tests doesn't seem to make sure x,y values are plotted as expected. I was testing the test cases for other reasons and noticed that the x and values were opposite. You can check the x and y values are returned in the expected order with something like this:

fig, ax = plt.subplots(1,1)
p = ax.plot([1, 2], [3, 4])
p[0].get_data()
Out[12]: (array([1, 2]), array([3, 4]))

Something similar can probably be done with hue and size as well so there shouldn't be any need for the tricky display tests.

Jul 26 '21 22:07 Illviljan

Would be easy if we could just force named arguments?

Yes that would be a good idea. A deprecation would be nice but for the backends refactor (open_dataset) this directly became an error.

For example an acccidentally inverted plot didn't crash the tests.

Good you noticed - I also think the plotting tests are not comprehensive.

Jul 30 '21 17:07 mathause

I'm struggling getting categorical colorbars to work nicely:

import numpy as np
import xarray as xr

das = [
    xr.DataArray(
        np.random.randn(3, 3, 4, 4, 2),
        dims=["x", "row", "col", "hue", "size"],
        coords=[range(k) for k in [3, 3, 4, 4, 2]],
    )
    for _ in [1, 2]
]
ds = xr.Dataset({"A": das[0], "B": das[1]})
ds.hue.name = "huename"
ds.hue.attrs["units"] = "hunits"
ds.x.attrs["units"] = "xunits"
ds.col.attrs["units"] = "colunits"
ds.row.attrs["units"] = "rowunits"
ds.A.attrs["units"] = "Aunits"
ds.B.attrs["units"] = "Bunits"

ds2 = ds.copy()
ds2["hue"] = ["d", "a", "c", "b"]
g = ds2.plot.scatter(
    x="A",
    y="B",
    hue="hue",
    markersize="size",
    col="col",
    add_legend=True,
    add_colorbar=True,
)

My goals are:

4 colors for 4 categories
ticks should be centered on the color.

One of the issues are that https://github.com/pydata/xarray/blob/135a3351bf77a4a55e76a8c60b40852ec10cdd4a/xarray/plot/utils.py#L69 Seems to be focused on contour(f) plots only which leads to the colorbar having N-1 colors all the time.

Any ideas how to solve this is appreciated.

Related links:

https://stackoverflow.com/questions/14777066/matplotlib-discrete-colorbar
https://gist.github.com/jakevdp/8a992f606899ac24b711

Nov 28 '21 11:11 Illviljan

Not sure if this is too naïve from my side but can you just pass N+1 levels to the function?

Dec 01 '21 15:12 mathause

Not sure if this is too naïve from my side but can you just pass N+1 levels to the function?

Pretty much what I did. ["a", "b", "c"] is converted to [1, 3, 5]. The discrete colors/levels are bounded by [(0, 2), (2, 4), (4, 6)]. This way the ticks will be put nicely in the middle of the discrete color.

Dec 01 '21 21:12 Illviljan

Gotten a little sidetracked with line plots for a while now. I'm annoyed that all the primitives are different for each plotting type, e.g Collections, Line2D, list of Line2D, etc. It makes it hard to use similar code paths. So I've been trying out LineCollection a little which behaves very similarly to ax.scatter. Being able to use hue and size the same way as in scatter seems like a big improvement to me.

In the example below you can clearly see that we have two curves where the hue changes somehow over time. But I'm having trouble understanding how to determine what's supposed to be a line. For example if x and y were 4d arrays how should it be split? scatter uses ravel to get around these hard questions. Maybe you can ravel just certain dimensions?

Like if y(a,b,c,d) and x(a, b) maybe you can

ravel along a,b and create a single line
- Will this require sorting as unsorted lines usually looks bad? Does it make sense at that point?
new lines will be generated along remaining dims; c, d.
- How do you label these new lines, since color and linewidth can now be used along lines as well? Using linestyle along c? Those are finite though. And what about the d dim and possibly other remaining dims? Maybe they should just remain a mystery? You probably could play around with hue or size to see those changes.

Not sure how to do that in xarray though. Thoughts, ideas or other examples are appreciated.

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
from matplotlib.collections import LineCollection
import numpy as np

np.random.seed(42)

dates = np.arange("2017-01-01", "2017-06-20", dtype="datetime64[D]" )
y = np.cumsum(np.random.normal(size=len(dates)))
y2 = np.cumsum(np.random.normal(size=len(dates)))
c = np.cumsum(np.random.normal(size=len(dates)))
c2 = np.cumsum(np.random.normal(size=len(dates)))
s = 1 + np.minimum(np.cumsum(np.random.normal(size=len(dates))), 0)
s2 = 1 + np.minimum(np.cumsum(np.random.normal(size=len(dates))), 0)

fig, ax = plt.subplots()

#convert dates to numbers first
inxval = mdates.date2num(dates)
points = np.array([inxval, y]).T.reshape(-1,1,2)
segments = np.concatenate([points[:-1],points[1:]], axis=1)

lc = LineCollection(segments, cmap="plasma", linewidth=s)
lc.set_array(c)
p = ax.add_collection(lc)


points2 = np.array([inxval, y2]).T.reshape(-1,1,2)
segments2 = np.concatenate([points2[:-1],points2[1:]], axis=1)

lc2 = LineCollection(segments2, cmap="plasma", linewidth=s2)
lc2.set_array(c2)
p = ax.add_collection(lc2)

fig.colorbar(p, ax=ax)


ax.xaxis_date()
ax.autoscale_view()
plt.show()

Jan 09 '22 01:01 Illviljan

Line plots aren't completely broken now. The array is being stacked to 1 dim now. Using nan to split the lines at appropriate places.

ds = xr.tutorial.scatter_example_dataset(seed=42)
hue_ = "y"
x_ = "y"
size_="y"
z_ = "z"

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1, projection='3d')
ds.A.sel(w="one").plot.line(x=x_, z=z_, hue=hue_, linewidth=size_, ax=ax)
ax = fig.add_subplot(1, 2, 2, projection='3d')
ds.A.sel(w="one").plot.scatter(x=x_, z=z_, hue=hue_, markersize=size_, ax=ax)

Jan 22 '22 00:01 Illviljan

pre-commit.ci autofix

Jun 18 '22 14:06 Illviljan

xarray xarray copied to clipboard

Replace dataset scatter with the dataarray version

Comment last updated at 2021-11-05 06:38:18 UTC

Unit Test Results

xarray
xarray copied to clipboard