julia icon indicating copy to clipboard operation
julia copied to clipboard

RFC: require reduce-like functions to use `init` one or more times

Open mbauman opened this issue 1 year ago • 5 comments

OK, here's take two, building on the work that was done in #53871. We preserve the neutral element requirement but guarantee that init is used to start all evaluation chains.

The news summarizes as succinctly as I can:

init is now guaranteed to be used for non-empty collections one or more times. This ensures that all calls to the reducing function op have one argument that is either init or the result from a previous op evaluation. Previously init was explicitly allowed to be omitted from the reduction entirely.

And the docs for mapreduce say a bit more:

If provided, init is the return value for empty collections and is used one or more times as an argument to op for non-empty collections. The init value is not transformed by the function f. Using init ensures that all calls to op have one argument that is either init or the result from a previous op evaluation, and the ordering of these arguments is unspecified. As it may appear in the reduction one or more times, it must be a neutral element for op that does not change the result by being used more than once. It is generally an error to call mapreduce with empty collections without specifying an init value, but in unambiguous cases an identity value for op may be returned; see Base.reduce_empty for more details.

Closes https://github.com/JuliaLang/julia/issues/49042.

mbauman avatar Apr 03 '24 22:04 mbauman

Do we want to specify whether it's op(init, _) or op(_, init)?

jariji avatar Apr 03 '24 22:04 jariji

Do we want to specify whether it's op(init, _) or op(_, init)?

I don't think there's a meaningful reason to do so, because subsequent steps might be op(op(...), _) or op(_, op(...)). That's the interesting thing in my head — that you ~~always have one argument from the collection and one result_or_init. Since the location of the result isn't defined, I don't think the location of the init should be, either.~~

Edit: Oh duh, that's not true. You sometimes have two results. Duh. But my point is that you never have two elements, and that which one may be a value from the collection is unspecified.

But now I need to clean up that language some more. Shucks. 🤦

mbauman avatar Apr 03 '24 22:04 mbauman

Do we want to specify whether it's op(init, _) or op(_, init)?

~~I think this is a fair point, especially since fold[lr] immediately refer to reduce. Either option is fine by me. Using the same order for all three (six including map*) makes switching seamless.~~

~~It most cases, both arguments will likely have the same type, i.e. op(::T, ::T) where T. But surely, there will be some cases where this is not the case, such that it is necessary to specify whether it is op(acc, val) or the other way around.~~

I think I had a non-associative example in mind, for which I should have used foldl/foldr instead. We do want to allow op(acc1, acc2).

jonas-schulze avatar Apr 04 '24 08:04 jonas-schulze

This docstring feels awfully long, especially when viewed in help> or as a function hint in an editor - maybe it's better to have this behavior recorded in the manual, and refer to that section in the docstring?

Seelengrab avatar Apr 04 '24 10:04 Seelengrab

I'm a bit late to the party, but after consideration and discussion with @mbauman, I'm on board with this change. I think the guarantee can be expressed more clearly this way:

When init is not provided, one or more op calls will take collection values as both arguments. When init is provided, collection values will never occur as left arguments to op, only as right arguments; the left argument to op will be init for one or more op calls, or the result of a previous call to op otherwise.

I agree that the wording of the new docs is a bit verbose, but we can always worry about that after the functionality is nailed down.

StefanKarpinski avatar May 01 '24 21:05 StefanKarpinski

The argument ordering is very hard to express without explicitly invoking an op call tree. Perhaps that's the best way to write it anyway. Something like

The collection values are guaranteed to appear as leaves in the tree of op calls in order, but that does not guarantee that those op calls are evaluated in that order. For example, the left and right halves of an array may be reduced concurrently. When the results of those two reductions are combined the left and right reduced values will respectively be the left and right arguments to the final op call.

StefanKarpinski avatar May 02 '24 14:05 StefanKarpinski

I don't like requiring that init needs to be a left-argument to op because that means that foldr would not be a valid implementation for reduce. But otherwise I agree. I'm seeing if I can get the description a little more clear with fresh eyes here.

mbauman avatar Jul 24 '24 15:07 mbauman

Is "init is the right argument of op in foldr" documented?

jariji avatar Jul 24 '24 19:07 jariji

As an example, yes.

mbauman avatar Jul 24 '24 19:07 mbauman

OK, I've pushed a new re-wording. I find this more clear, personally.

We seem have a very solid consensus that "reduce-like functions use init one or more times" so I've removed the RFC from the title. I think I've faithfully represented that idea in the docs with this latest push.

Of course comments on the wordings are still very welcome, but I am much happier with the state of this now.

mbauman avatar Jul 24 '24 20:07 mbauman

It makes me a tad anxious that we don't guarantee whether the init is a first or second argument to op. Could there be a situation where this is an issue, where we might need distinct left and right neutral elements? Does a neutral element always commute for an associative operation? This might be a trivial result from group theory, but I don't know the answer.

mikmoore avatar Jul 25 '24 16:07 mikmoore

Does a neutral element always commute for an associative operation?

No: https://en.wikipedia.org/wiki/Semigroup#Identity_and_zero

nsajko avatar Jul 25 '24 16:07 nsajko

A two-sided identity (or just identity) is an element that is both a left and right identity. Semigroups with a two-sided identity are called monoids. A semigroup may have at most one two-sided identity. If a semigroup has a two-sided identity, then the two-sided identity is the only one-sided identity in the semigroup. If a semigroup has both a left identity and a right identity, then it has a two-sided identity (which is therefore the unique one-sided identity).

So a monoid is required, not just associativity.

nsajko avatar Jul 25 '24 16:07 nsajko

I have two reasons why that's not a major concern in my view. One is that there are three "types" of arguments that get passed to op:

  • init itself
  • values from itr — which necessarily could be either left- or right-args (but not both when init is provided)
  • return values from previous op calls — which are also necessarily left- or right-args (or both)

Because the init value is also the return value for empty collections, it should be of a similar "type" (yes, I mean that two ways: there's the literal Type, but it also applies more generally as a softer "interpretation" or "category of thing") as op's return values. And op necessarily needs to be able to accommodate previous return values in either slot.

The second is more simplistic, but it really seems like foldr should be a valid implementation for reduce, and foldr(=>, [1,2,3], init=0) is (and should be) 1 => (2 => (3 => 0)).

mbauman avatar Jul 25 '24 16:07 mbauman

Perhaps more succinctly, just look at foldl. With init, it will pass all its itr values as right-args. All the left args (except one) are results from previous operations. And init is that final left-arg. Flip the associativity with -r and the argument orderings flip. So init's placement flips. It makes sense that an arbitrary associativity would grant this leeway.

If you care about argument orderings between init and an itr value, you probably care about the argument orderings between an op result and an itr value — and so then you care about the associativity. And you should use foldl or foldr.

mbauman avatar Jul 25 '24 17:07 mbauman

When the operation is +, should we check iszero(init) for backward compatibility and print a deprecation warning if it is false (calling mapfoldl as a fallback)?

stevengj avatar Jul 25 '24 20:07 stevengj

Well the previous documentation explicitly says that it may or may not be used... but it'd probably be a good idea to get an idea of how many folks were relying on the old implementation. I say we first try the "full pairwise" in #52397 so we can run PkgEval there to gather the data. A deprecation path would be kind but would probably have to be fairly finely scoped (the generic iszero can throw method errors).

mbauman avatar Jul 25 '24 21:07 mbauman

One behavior I'd like to preserve is

julia> sum(fill(-0.0, 0))
0.0

julia> sum(fill(-0.0, 3))
-0.0

~~It seems that the way to get this is to make sum on IEEEFloat use init=-0.0 (the IEEEFloat additive identity) but allow (the option of) a distinct return value of +0.0 for the isempty case. I hate that this makes the implementation more complicated but I don't see a great way around it.~~ Thanks to mbauman for clarifying the missing-init case below. It was probably made clear a while ago but I came back to this after seeing the discussion on 55318 today. If I'd re-read more carefully I might have noticed that probably nobody was proposing we provide default values for init.

I would be somewhat unhappy if a sum of -0.0s yielded +0.0 (sum(x) would differ from +(x...) in a not-just-reassociation way), and I expect the change would break things (both existing and yet-unwritten) in at least a few places, but I know most people would be unhappy if an empty sum yielded -0.0 (and this would almost certainly break things in many places).

(+0.0 vs -0.0 might seem pretty inconsequential until you write inv(sum(x)))

mikmoore avatar Aug 30 '24 14:08 mikmoore

I've become completely radicalized against initial value guesswork when it's not specified, and the proposed behavior change here is only about what to do when init is specified.

Guessing at init is simply so bug-prone. I think the only sensible answer here is ensuring we always do what the change in #55318 does for dimensional reductions. That is:

  • 0 values:
    • Have init? Return it. Otherwise:
    • Return mapreduce_empty (usually an error, sometimes an unambiguous value like 0.0)
  • 1 value (a):
    • Have init? Return op(init, f(a)). Otherwise:
    • Return mapreduce_first (typically f(a))
  • 2+ values (a1, a2):
    • Have init? Start every chain with op(op(init, f(a1)), f(a2)) and continue. Otherwise:
    • Start every chain with op(f(a1), f(a2)) and continue

In other words, neither 0.0 nor -0.0 in your above cases come from a divined-out init element that wasn't passed.

mbauman avatar Aug 30 '24 15:08 mbauman