awkward icon indicating copy to clipboard operation
awkward copied to clipboard

Convenience methods for plotting and histogramming

Open jpivarski opened this issue 2 years ago • 4 comments

Description of new feature

One thing that's a little annoying right now is getting data from Awkward Arrays into plots. A lot of my tutorials include stanzas like

hist.Hist.new.Reg(100, 0, 5).Double().fill(ak.flatten(

    awkward_array_goes_here,

axis=None)).plot()

but in the heat of interactive data analysis, you have some intermediate array result that you want to quickly plot to see what's happening. I find myself saying that the value of array-oriented interfaces is the quick interleaving of single-operation-multiple-data and feedback about that step, so that you catch mathematical errors early. (For instance, an unexpected spike at zero, a surprisingly long tail, or smeared trigger thresholds, missing mass peaks, etc.) However, the above is not easy to quickly plot.

We generally try to minimize method names in the ak.Array namespace, but suppose we add one more named hist, which flattens data and fills a histogram from the hist library? With the exception of selecting fields from the Awkward record array (if any), most of the arguments would be passed directly to hist for hist to take care of (and ModuleNotFoundError if hist can't be imported).

Some features that would be nice to have:

  • The ak.Array must either have no nested RecordArrays or must take a string/tuple of strings for which fields to put in the histogram. After applying those fields, the resulting columns must be non-RecordArrays. If the string conforms to [A-Za-z_][A-Za-z_0-9]*, we should use __getattr__ instead of __getitem__ so that Vector properties like "pt" can be used on an array of Cartesian vectors. Maybe that field selector can be a dict if one needs to map Awkward field names to hist axis names that aren't exactly the same.
  • After ensuring that the selected fields (if any) broadcast to one another, they should each be flattened to one-dimensional NumPy arrays before passing to hist's fill. Nones and NaNs are removed (unless hist automatically ignores NaNs). Yes, I like to insist that flattening isn't always the right thing to do, and so it shouldn't be done automatically, but it's so often the right thing to do that it ought to be a default. (Auto-flattening in a method named hist is a different proposition from auto-flattening every time __array__ is called; that would be too much/dangerous.)
  • A pre-existing histogram can be specified as an argument: hist=h, in analogy with Matplotlib's ax=ax. In this case, the existing histogram is filled.
  • If no hist is given, the method should create a new histogram, but what syntax should it use to do so? hist.numpy.histogram? The hist constructor? Would that require users to import hist so they can say things like hist.new.Reg?
  • The return value should always be a histogram object, not a plot. The chain array.hist(*args).plot() is not unwieldy. Or maybe, should it be a proxy that must be continued with a hist.Hist.new-like chain, excepting that it does the fill implicitly?
  • What about scatter plots? Should we just say that we always do 2-dimensional histograms instead of scatter plots, so that we don't run into problems with lots of marker objects from large arrays? (Maybe Matplotlib handles that gracefully when the input type is NumPy?) At least, we don't want the hist return value to sometimes be "a thing that can be plotted" and sometimes be "a plot." It should always be one or the other.

Oh, and this would only be for v2. It might even be listed in a "reasons to switch" that we'll be publicizing next year.

@henryiii and @amangoel185 may have opinions about how this hist integration/ergonomics should go.

jpivarski avatar Jun 03 '22 15:06 jpivarski

awkward_array_goes_here.hist.Reg(100, 0, 5).Double().plot()? Or .hist.new. to give options to add other interfaces like .hist.auto(), for example, which would do the numpy-like auto binning? Using QuickConstruct is the right thing, I think, because it doesn't require imports to work. We can add some infrastructure for auto-fill, I think.

I'd not worry about adding a mapping of hist axes names; you can already set a label to show something different than the name, or change the names later.

I'd also like to remove the need to add the .flatten if we can when passing to hist's .fill. Hist's fill could be smart enough to do that, I think? At least if it matches.

Scatter plots would have a different name, right? awkward_array_goes_here.scatter_plot or something, given it makes a 2D array and then plots it.

henryiii avatar Jun 03 '22 15:06 henryiii

I'm slightly in favour of not adding an ak.hist feature to the Awkward namespace, mainly because I think that these features should be added to hist.[^note]

In your example you note that the awkward array ends up appearing half-way into the code block. Interestingly, I tend to make my fill calls comprise solely of the identifiers, e.g.

...
n_saturated_left = ak.count_nonzero(left.is_saturated, axis=-1)
n_saturated_right = ak.count_nonzero(right.is_saturated, axis=-1)

return (
    Hist.new.Integer(0, 128).Int64().fill(n_saturated_left).fill(n_saturated_right)
)

I think I've been subconciously driven to do this because the complexity of the hist construction otherwise adds too much noise to the code. That said, I think this noise is fairly inescapable - histograms need a lot of detail to be constructed, so perhaps they should be their own line of code.

I ended up making a list of features that I think we/hist could benefit from. I'll enumerate them here:

  • Hist.fill could invoke ak.flatten(..., None)
  • Hist.fill could accept mappings
  • ak.flatten(..., None) could support preserving records
  • External addon namespace

I think first-and-foremostly, making hist.fill call ak.ravel on the array would solve many of my 1D-hist cases. Even for ND histograms, I can take care of broadcasting, but ravelling as well is a bit of a pain, e.g. some code I wrote yesterday:

arrays = np.broadcast_arrays(
    zone, n_saturated > 0, n_ringing_chain, n_non_ringing_chain
)
hist.fill(
    *[
        np.ravel(x)
        for x in arrays
    ]
)

vs

arrays = np.broadcast_arrays(
    zone, n_saturated > 0, n_ringing_chain, n_non_ringing_chain
)
hist.fill(*arrays)

I don't know whether I'd go as far as wanting hist to perform broadcasting automatically. As useful as that sounds, I feel like it could be problematic given that the user should expect that hist doesn't do anything weird with their data. I'm 100% open to people changing my mind on that particular issue.

I wonder whether it would be helpful for hist to support mappings, a bit like matplotlib does with data=. Then we'd also have a solution for the current requirement that the user repeat the field names:

array = ak.zip({
    'mag': [1, 0.2, 0.3, 0.9, 0.8, 0.4],
    'x': [10, 20, 30, 4, 5, 6],
})
hist = Hist.new.Reg(1024, 0, 1, name="mag").Reg(1024, 0, 10, name="x").Int64().fill(array)
hist.plot()

Actually, this now makes me think about #984, which would make it easier to work with matplotlib as well. Right now, we have to flatten the fields manually, and then rebuild the array with ak.zip:

hist.fill(*[ak.flatten(x) for x in x in ak.unzip(array)])

But with better ak.flatten, and adding support to hist for Awkward arrays (both w.r.t flattening and records), we could have

hist.fill(array)

Having written all of that, I'm slightly nervous that I'm proposing adding Awkward features to Hist, which might really be a bad design choice in terms of inversion of responsibility.

If we did add hist support to Awkward, are we making a special exception for it because of the convenience? Would there be merit in adding some kind of "extern" namespace to the array that avoids clashing with user fields and allows for future extension, e.g.

array._.hist
array.lib.hist
array.ext.hist

[^note]: that is, if we add better Awkward support to hist vs hist-specific improvements to Awkward.

agoose77 avatar Jun 03 '22 17:06 agoose77

Regardless of whether hist does broadcasting (it seems like that would make it depend too much on Awkward), what I'm thinking about here is the ergonomics of getting a quick plot. Similarly, we have to_list as a method on ak.Array in addition to it being a function because when you're typing an array expression, going to the beginning of the line, adding "ak.to_list(", then going to the end and adding ")" is more effort than typing ".tolist()" at the end, where you already are. (Especially if you're working on someone else's terminal and ctrl-A doesn't go to the beginning of the line, but instead does Select All!)

What I'm thinking about here would intentionally be a thin interface, relying on hist to interpret its arguments. The QuickConstruct method seems like the right thing to do, but only if it doesn't mean that we have to implement hist proxy work-alikes: intermediate objects of the same sort that hist defines to do QuickConstruct, but our own because we need it to do a fill at the end. If we can accomplish that through some coordination between Awkward and hist, so much the better. (For instance, hist adds a "fill" that can be put at the beginning of the QuickConstruct chain, which is an odd thing to do, but it would give Awkward a backdoor do use hist's own implementation and still insert its directive to fill with a given set of arrays.)

We have other connections with external libraries, all in the src/awkward/_connect directory; this would probably go there, too.

jpivarski avatar Jun 03 '22 18:06 jpivarski

it seems like that would make it depend too much on Awkward

Well, I'd suggest it just uses np.broadcast_arrays as that would be implementation agnostic?

I understand the benefit of the method chaining vs function composition. I actually just switched out my own Dask helper library to use function composition in order to make the abstraction thinner, but I can see that if you don't use Black and/or want to adapt your coding style, it's not as frictionless.

Recognising that my opinion is towards "let users type more if it means easier to read code", I'm going to put it aside for now ;)

What about a simple hist-fill method? Assuming we want to rely explicitly on hist, the user could be allowed to construct the hist factory, or just pass in the field names if required, e.g.

array = ak.zip({
    'mag': [1, 0.2, 0.3, 0.9, 0.8, 0.4],
    'x': [10, 20, 30, 4, 5, 6],
})

array.hist("mag") # 1D
array.hist("mag", "x") # 2D
array.hist(Hist.new.Reg(32, 0, 1, name="mag").Reg(32, 0, 10, name="x").Int64()) # Fill 2D hist

Or, do you think a full-blown ak-namespaced proxy is better, e.g.

array.hist.Reg(32, 0, 1, name="mag").Reg(32, 0, 10, name="x").Int64() # Fill 2D hist

agoose77 avatar Jun 03 '22 19:06 agoose77