awkward icon indicating copy to clipboard operation
awkward copied to clipboard

argmin keepdims vs. singletons

Open nsmith- opened this issue 4 years ago • 9 comments

I see what appears to be an inconsistency between using keepdims=True in a reducer and the behavior of ak.singletons:

>>> a = ak.Array([[3, 1, 2], [4, 5], []])
>>> ak.argmin(a, axis=1, keepdims=True)
<Array [[1], [0], [None]] type='3 * var * ?int64'>
>>> ak.singletons(ak.argmin(a, axis=1))
<Array [[1], [0], []] type='3 * var * int64'>

This can get a bit annoying when I want to then use the indexer to slice the array:

>>> a[ak.argmin(a, axis=1, keepdims=True)]
<Array [[1], [4], [None]] type='3 * var * ?int64'>
>>> a[ak.singletons(ak.argmin(a, axis=1))]
<Array [[1], [4], []] type='3 * var * int64'>

It would be nicer if the latter were the default, since although there is some ambiguity between [] and None for the third outer list element, I think [None] gets the bronze medal here.

nsmith- avatar Jul 07 '21 20:07 nsmith-

Related discussion: https://github.com/scikit-hep/awkward-1.0/discussions/710

nsmith- avatar Jul 07 '21 20:07 nsmith-

ak.firsts and ak.singletons aren't fully thought through, at least for their action on more than one dimension. Their behavior can be changed if it's going to make them more consistent with other functions.

You've been their biggest proponent (I wanted to remove them some time ago, but you convinced me otherwise). I'd say you're free to refine their definitions for ndim > 1.

jpivarski avatar Jul 07 '21 22:07 jpivarski

@nsmith- are you advocating for keepdims=True to produce empty lists instead of [None] with a reducer? If so, such a change would be a breaking one, unless we add another parameter to the reducers. Of course, that doesn't answer the question of whether it should be done!

I can clearly see why the existing behaviour for keepdims=True is perhaps non-ideal. Given that, I considered the following possible behaviours:

  • [None] - introduces an option type and increases the length of the empty list.
  • [] - does not introduce an option type, does not increase the length of the empty list.
  • None - introduces an option type, entirely masks the empty list

Deciding between these options is not trivial (in my opinion), and I've reconsidered my position several times in the course of writing this.

Disclaimer I've been thinking about this for probably too long of an evening, so read at your peril.

For irregular arrays, it seems that we need to understand our expectations of a reducer, namely

  • should a reducer always produce fixed-size lists in the reduction axis (for keepdims=True)?

One example of where the answer is no is for Awkward advanced indexing + argmin/max; we need to keep the dimensionality of the lists in order for the lookup to work. For this purpose, returning [None] is not ideal; semantically we want to ignore empty lists in the index, whereas [None] includes them (albeit as None) (as you demonstrate).

Now, to choose between [] and None, I do not have a strong conclusion either way. By choosing None over [] there is some symmetry between keepdims=False (which necessarily introduces None empty sublists) and keepdims=True. But this isn't a compelling case.

I'm sure @jpivarski has some more concrete ideas about the best behaviour here, because my mind is exploding trying to keep up with all of the possible use cases / scenarios.

agoose77 avatar Jul 08 '21 21:07 agoose77

mind is exploding

This happens a lot. I go by intuition but its not always right. One helpful exercise is to consider the regular case:

a = np.array([[3, 1, 2], [4, 5, 3], [1, 2, 3]])

If I want to select the min element index of the inner list, probably the most straightforward is

a[[0, 1, 2], np.argmin(a, axis=1)]

(of course if we really were just interested in a then a.min(axis=1) would be enough, here we're assuming we want to look at some other value b with the same shape) For awkward, that won't work

>>> a[[0, 1, 2], ak.argmin(a, axis=1)]
ValueError: cannot mix missing values in slice with NumPy-style advanced indexing

since the argmin produces an optiontype 3 * ?int64. We could remove that (e.g. ak.fill_none(ak.argmin(a, axis=1), -1)) and it would work. Probably for regular dimensions an explicit simplification of the optiontype should be done since the reducer will never return null. (I guess this is more of a bug than my original issue)

But anyway the same will not work for our original irregular array:

>>> a = ak.Array([[3, 1, 2], [4, 5], []])
>>> a[[0, 1, 2], ak.argmin(a, axis=1)]
ValueError: cannot mix missing values in slice with NumPy-style advanced indexing

so we have to provide some alternative means of indexing. The good news is we have awkward-style advanced indexing, and all three possibilities work:

>>> a[ [[1], [0], [None]] ]
<Array [[1], [4], [None]] type='3 * var * ?int64'>
>>> a[ [[1], [0], []] ]
<Array [[1], [4], []] type='3 * var * int64'>
>>> a[ [[1], [0], None] ]
<Array [[1], [4], None] type='3 * option[var * int64]'>

corresponding to ak.argmin(a, axis=1, keepdims=True), ak.singletons(ak.argmin(a, axis=1)), and some hypothetical alternative. (I guess ak.singletons(ak.argmin(a, axis=1)).mask[ak.num(a)>0]). My take is that option 2 is the best, partly because it has the simplest layout, but I don't have much of an argument beyond that.

nsmith- avatar Jul 09 '21 14:07 nsmith-

One argument against [None] is that it is the only one that produces masked values after flattening on axis 1:

>>> ak.flatten([[1], [4], [None]], axis=1)
<Array [1, 4, None] type='3 * ?int64'>

>>> ak.flatten([[1], [4], []], axis=1)
<Array [1, 4] type='2 * int64'>

>>> ak.flatten([[1], [4], None], axis=1)
<Array [1, 4] type='2 * int64'>

nsmith- avatar Jul 09 '21 15:07 nsmith-

@nsmith- I'd say that's an argument in favour of [None], because the structure remains compatible with the original array. Consider

>>> ak.flatten([[1], [4], [None], [6]], axis=1)
<Array [1, 4, None, 6] type='4 * ?int64'>
>>> ak.flatten([[1], [4], [], [6]], axis=1)
<Array [1, 4, 6] type='3 * int64'>

agoose77 avatar Jul 09 '21 15:07 agoose77

I disagree, flatten almost always alters the broadcast-compatibility of arrays. Meanwhile ak.firsts can maintain the structure (in both cases).

nsmith- avatar Jul 09 '21 15:07 nsmith-

I've been thinking about this some more @jpivarski, and it seems to me that we want argmin et al. to be reducers. This means that we have to support an identity element in the case that the sublist is empty, else we break the interface of reducers (always introducing elements for empty lists).

My understanding is that @nsmith- prefers the ak.singletons result, which doesn't introduce an element for an empty list. i.e., we like the behaviour of singletons.

I'll add the docs label because, whatever happens, we should improve the documentation on this.

agoose77 avatar Nov 18 '22 15:11 agoose77

it seems to me that we want argmin et al. to be reducers. This means that we have to support an identity element in the case that the sublist is empty, else we break the interface of reducers (always introducing elements for empty lists).

I think the default should be mask_identity=True to return None for empty lists. Each of the reducers and reducer-like functions have a different mask_identity default (#1873).

When mask_identity=False, the argmin and argmax functions return -1 for empty lists. This is not a true identity (there is no such thing for argmin and argmax), but it is a value that can never be returned from any non-empty list. In the usual application of slicing an array with the output of argmin/argmax on an array with the same shape, the attempt to select the -1 (last) element from an empty list would fail.

This seems like the best thing to do for argmin/argmax. The only other behavior I can think of is the Awkward 0.x behavior, which was essentially like having keepdims=True always on and [] was generated for empty lists, rather than [None]. But that differs significantly from NumPy's behavior. We shouldn't go back to anything like that.

jpivarski avatar Nov 18 '22 17:11 jpivarski