algebird Adding MutableSumAll and Semigroup#bufferedSumOption,statefulSummer

Refactoring some Semigroup#sumOption optimizations to use these.

Closes #291

These should be getting test coverage indirectly, but I'm open to suggestions for additional tests to add.

Apr 04 '14 21:04 jnievelt

I like this. The only thing I wonder about is whether it's worth having both sumOption and bufferedSumOption, or whether it should just be assumed that sumOption might buffer, and it perhaps takes an extra parameter for the buffer size which you can tune if necessary (which it would pass through to statefulSummer in the default implementation).

Apr 04 '14 23:04 avibryant

I like having BufferedSumAll#operate leverage Semigroup summing. But without a separate target for BufferedSumAll#operate, it becomes easy to shoot one's self in the foot via:

override def statefulSummer() = new BufferedSumAll(100)(this)

Having said that, this is half a real candidate and half an effort to articulate the approach as described in #291.

Apr 05 '14 00:04 jnievelt

Sorry, I'm not following that last. Are you just worried that people will set up an infinite recursion from sumOption -> statefulSummer -> BufferedSumAll -> sumOption?

Apr 05 '14 14:04 avibryant

Yes, that's correct.

Apr 07 '14 17:04 jnievelt

@jnievelt I'm super excited that you are working on optimizations, but so is Ian (as a big part of his quarterly goals). Can you guys sit together and map out a plan so that we can minimize duplication of effort?

Apr 16 '14 17:04 johnynek

So increasing the API surface area i'm a little concerned with, given we won't be removing sumOption, maybe we could deprecate it if we want to go with this approach? Do we have a solid notion when this will improve perf? I'm not sure it doesn't make our API more complex, with possibly limited gains from this in our most common use cases (scalding, summingbird) at least.

Apr 18 '14 02:04 ianoc

I don't think there's a huge perf gain -- in summingbird online (and similar applications) it would enable arbitrarily long accumulations without buffering (and without GC?).

The thing I like more about it is the cleaner separation of logic. Rather than writing all of SketchMap's current #sumOption, we can simply designate that it will do a BufferedSumAll and the logic for reducing down a buffer. For HLL/Map, we have clearly labeled methods. I believe this sort of simplification (in client code) was at the heart of #291, and for this much I like the idea of adding #statefulSummer.

I certainly understand the concerns around #bufferedSumOption, and I'm fairly neutral on whether it's added or not. The only reason I used it in this PR instead of putting that logic into an overloaded BufferedSumAll#operate is that it prevents a very simple infinite recursion.

Apr 18 '14 03:04 jnievelt

I think this proposal has great merits. Here is some more thinking around this idea (thanks to @ianoc for helping me grokking this): The notion of StatefulSummer is the dual to the sumOption (I’m making a parallel to Erik Meijer’s point of Observable/Iterable duality) at least with the following definition:

Summer[T] {
   def put(v: T): Unit
   def getSumAndReset: Option[T]
}

Semigroup[T] {
   def plus(x: T, y: T) : T
   def sumOption(IterableOnce[T]): Option[T] // pull model
   def summer: Summer[T]  // push model
}

In one case we have a pull model: we pass an Iterator to sumOption that will consume it all and return the corresponding sum.
In the other case we have a push model: we get a Summer from the Semigroup and we pass (put(v)) it all the values and get the resulting sum (getSumAndReset)

Based on that definition, they are exactly the same thing except one pushes and the other pulls. This also solves the convergence problem that the current StatefulSummer has when put can return and intermediary result. The reason is the current StatefulSummer definition is used for very a different situation. When the monoid is defined for types that may not fit all in memory and we just want temporary partial aggregation. This is used to sum Map[K,T:Semigroup] to optimize map-side partial aggregation and not necessarily materialize the full sum. So we should separate those 2 use cases. For example when we pass a Semigroup to sumByKey in summingbird we expect to be able to sum all the values for the same key. Internally we may use SG[Map[K,T:SG]] to optimize it with SummingCache.

Here are some example on how to deal with the hypothetical new API:

for simple cases where we define only plus:

PlusBasedSG {

   def sumOption(i: IterableOnce[T]) = {
       i.reduceLeftOption(Semigroup.plus(_, _))
   }

   def summer = new Summer {
      private current: Option[T]
      def put(v: T) = current = current.map(_ + v).orElse(v)
      def get = {
          val r = current 
          current = None
          r
      }
   }

}

For cases where we implement a summer: Note that this doesn’t restrict in any way how optimizations might be implemented depending on how many element have been pushed so far.


SummerBasedSG {

   def sumOption(i: IterableOnce[T]) = {
     i.foldLeft(summer) (_.put(_)).get
   }

   def summer = new Summer {
      /** whatever state is needed */
      def put(v: T) = /** logic suitable to optimization depending on the number of elements */
      def get = /** flush the state and return a result */
   }

}

For cases where sumOption is implemented: (notice that “batching” is now implemented once, to go from push to pull model)


SumOptionBasedSG {

   def sumOption(i: IterableOnce[T]) = /** existing efficient sumoption */

   def summer = new Summer {
      val state: Buffer[T]
      def put(v: T) = {
           state += v
           if (state.size > max) state += get // batching implemented once
      }
      def get = {
          val r = sumOption(state)
          state.clear
          r
      }
   }

}

Here is what the eventually monoid could look like in that case:


EventuallySG[E, O](lsg: SG[E], rsg: SG[O]) extends SummerBasedSG[Either[E,O]] {

   def summer = new Summer {
      private var state: Right(rsg.summer)
      def put(v: T) = {
           (state, v) match {
               case (Left(se), Left(ve)) => se.put(ve)
               case (Left(se), Rigth(vo)) => se.put(convert(vo))
               case (Right(so), Left(ve)) => {
                   val soFar = so.get
                   state = Left(rsg.summer)
                   state.put(convert(soFar))
                   state.put(ve)
               }
               case (Right(so), Right(vo)) => so.put(vo)
      }
      def get = {
          state match {
              case Left(se) => {
                  state = Right(rsg.summer)
                  Left(se.get)
              }
              case Right(so) => Right(se.get)
          }
      }
   }

}

Now we can choose the push model vs the pull model depending on what stile is preferred depending on the monoid.

As a side point I think we should decouple the new API discussion from the proposed changes to existing HLL monoid as I think they can progress separately.

Apr 18 '14 18:04 julienledem

Wanted to comment here:

Thanks to Joe and Julien for pointing out this duality between the pull (sumOption) and push (observer/summer/etc...) approach.
We have seen a lot of regressions on perf, so I think we need to make sure that each pref pull request has caliper benchmarks and we indeed see a non-trivial win.
Ian has voiced some fair concerns regarding complexity of a core trait, such as Semigroup. He proposed to me (and perhaps others offline) the idea of making another trait: just making something up: SemigroupWithSummer[T] extends Semigroup[T], which would let people opt into this behavior. That might satisfy more people, I don't know.
Summer[T] in the above could be immutable:

trait Summer[T] {
  def increment(that: T): Summer[T]
  def result: Option[T]
}

Do we have reason to think this formulation would cost performance? If we push onto a list with increment, and if we keep an atomic ref to Either[T, List[T]], we could swap the list out for a computed sum when result is called, and the whole thing is still referentially transparent.

I agree we should break up the pull requests here:

HLL optimizations with benchmarks
Observer/Summer pattern, benchmarks of mutable vs immutable implementations. Immutable is worth a small cost, in my view.
Example Composed Summers (e.g. Eventually) and caliper comparisons to the existing approaches.

Lastly: I think it is great to have design discussions. This is a great PR for that reason: it sparks a lot of thought.

Apr 19 '14 23:04 johnynek

A lot of this sounds really good. Adding a separate trait could be done relatively cleanly, I think. I would also encourage folks to comment on the original issue inasmuch as they are thinking generally about it (and not just this particular approach).

The push/pull comparison is quite valid, but I don't think it tells the entire story. In particular, I think the #summer should be something that, e.g., map-side reduce could use (though I'm not entirely sure why it, of all things, wouldn't use an explicit sum-all), especially if we add a maxBufferedCount parameter.

I'm also fine with a PR breakdown, and I'll proceed with this one as "Oberver/Summer pattern" unless there is an objection. If/when it gets merged we can file issues for the other two.

Regarding perf, I'll work on writing up caliper tests for various uses of the new code in comparison to existing approaches. Is there a place in the tree for such tests (even if they aren't part of 'sbt test' etc.)? Also should we update CONTRIBUTING.md with guidelines on this?

Finally, I would mention that this change isn't being driven by an acute need of mine (it just seemed like an interesting problem), and as a result I work on it only occasionally. If there is a need elsewhere, feel free to let me know and/or pick it up and run with it.

Apr 20 '14 18:04 jnievelt

@jnievelt Quickly just on the benchmark stuff, algebird-caliper has some of the first additions for the benchmarks. Caliper is proving to be a bit of a nightmare to work with however so I'll probably be looking to see if there's a good alternative. Once we have it set we should definitely update those guidelines. But continue/make entries in the caliper one, it has a runner there you can use sbt run against to run some tests with.

Apr 20 '14 18:04 ianoc

@johnynek 3) a new trait sounds good to me however it is easy to implement in terms of plus. 4) I think the whole point is that Summer is mutable. Otherwise we can't really use a mutable data structure to keep the state and we may as well use plus. An immutable Summer is no different from the current plus on semigroup.

Apr 21 '14 21:04 julienledem

Sorry my bad, git foo on cmd line broke stuff and closed all of these

Aug 04 '15 00:08 ianoc

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Joe Nievelt seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Jul 18 '19 15:07 CLAassistant