scalding icon indicating copy to clipboard operation
scalding copied to clipboard

Adding second moment of values per key for Typed-API reduce operations

Open oeddyo opened this issue 9 years ago • 6 comments

proof of concept for https://github.com/twitter/scalding/issues/1068

oeddyo avatar May 07 '15 22:05 oeddyo

DO NOT MERGE YET.

oeddyo avatar May 07 '15 22:05 oeddyo

@johnynek Hi Oscar. Thanks for walking me through the code today!

I missed one problem to discuss with you, which I note in the code.

so if I'm doing

  var numValuesPerKey = 0L

  val resIter = reduceFnSer.get(key, values)
  while (resIter.hasNext) {
    val tup = Tuple.size(1)
    val t2 = resIter.next

    numValuesPerKey += 1L

    tup.set(0, t2)
    oc.add(tup)
  }
  val valueCountSum = numValuesPerKey
  println("value count = " + numValuesPerKey)

For the test it would print

    value count = 1
    value count = 1
    value count = 1

Which should actually has a "value count = 2" for key 1. (please see test ReduceValueCounterTest for detail)

I have the test in

branch: exie/1068 test-only com.twitter.scalding.ReduceValueCounterTest

should be easy to replicate. Just uncomment the block and comment the block under it. (in Operation.scala line 509-524)

oeddyo avatar May 16 '15 04:05 oeddyo

The reason should due to:

  val resIter = reduceFnSer.get(key, caches.toIterator)
  while (resIter.hasNext) {
    val tup = Tuple.size(1)
    val t2 = resIter.next

    tup.set(0, t2)
    oc.add(tup)
  }

is trying to iterate the reduced result thus it's iterating through how many keys it has. Thus unfortunately we can't use a var to do a count to see how many values are associated with each key here.

oeddyo avatar May 18 '15 22:05 oeddyo

@johnynek Does this look good?

oeddyo avatar May 20 '15 22:05 oeddyo

After testing it for a couple more times, I confirm it's a bug. Here's how you could re-produce it:

Checkout the code above and uncomment line 523 in scalding/scalding-core/src/main/scala/com/twitter/scalding/Operations.scala

Then in sbt do

test-only com.twitter.scalding.ReduceValueCounterTest

It print out a line (corresponding to the code in CoreTest.scala line 1837) PRINTING KEY AND GROUP! 0

But if you use same group name as key name, then it gives PRINTING KEY AND GROUP! 3

oeddyo avatar Jun 04 '15 21:06 oeddyo

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Jul 18 '19 15:07 CLAassistant