ruby-spark
ruby-spark copied to clipboard
Block or Proc?
What is better way how to define a function on RDD?
# as Proc
rdd.map(lambda{|x| x*2})
# as block
rdd.map {|x| x*2}
Which method should be supported?
As Proc:
- the same way as in Python
- currently implemented
As block:
- what about
aggregate(zero_value, seq_op, comb_op)
- method needs 2 function
Both:
-
what about
reduce_by_key(f, num_partitions=nil)
-
if you would like to use block and num_partitions:
rdd.reduce_by_key(nil, 2){|x,y| x+y}
Optimal would be supporting all variants, you can easily decide whether a block was passed with block_given?
.
And moreover adding 1.9 lambda syntax:
rdd.map(&->(x){x*2})
which is syntactic sugar, however much easier to write.
-
->(x){x*2}
cannot be serialized (for now) -
What about
.aggregate(zero_value, seq_op, comb_op)
?# currently seq = lambda{|x,y| x+y} com = lambda{|x,y| x*y} rdd.aggregate(1, seq, com)
What will be a block? Seq_op, comb_op or nothing?
Great gem. BTW today I just found an interesting way to pass numbers by using format
v=100
RDD.map( "lambda{ |x| x / %d }" % v )
%s might be good for string too, have not tested.