scalding
scalding copied to clipboard
Adding second moment of values per key for Typed-API reduce operations
proof of concept for https://github.com/twitter/scalding/issues/1068
DO NOT MERGE YET.
@johnynek Hi Oscar. Thanks for walking me through the code today!
I missed one problem to discuss with you, which I note in the code.
so if I'm doing
var numValuesPerKey = 0L
val resIter = reduceFnSer.get(key, values)
while (resIter.hasNext) {
val tup = Tuple.size(1)
val t2 = resIter.next
numValuesPerKey += 1L
tup.set(0, t2)
oc.add(tup)
}
val valueCountSum = numValuesPerKey
println("value count = " + numValuesPerKey)
For the test it would print
value count = 1
value count = 1
value count = 1
Which should actually has a "value count = 2" for key 1. (please see test ReduceValueCounterTest for detail)
I have the test in
branch: exie/1068 test-only com.twitter.scalding.ReduceValueCounterTest
should be easy to replicate. Just uncomment the block and comment the block under it. (in Operation.scala line 509-524)
The reason should due to:
val resIter = reduceFnSer.get(key, caches.toIterator)
while (resIter.hasNext) {
val tup = Tuple.size(1)
val t2 = resIter.next
tup.set(0, t2)
oc.add(tup)
}
is trying to iterate the reduced result thus it's iterating through how many keys it has. Thus unfortunately we can't use a var to do a count to see how many values are associated with each key here.
@johnynek Does this look good?
After testing it for a couple more times, I confirm it's a bug. Here's how you could re-produce it:
Checkout the code above and uncomment line 523 in scalding/scalding-core/src/main/scala/com/twitter/scalding/Operations.scala
Then in sbt do
test-only com.twitter.scalding.ReduceValueCounterTest
It print out a line (corresponding to the code in CoreTest.scala line 1837) PRINTING KEY AND GROUP! 0
But if you use same group name as key name, then it gives PRINTING KEY AND GROUP! 3
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.