ruby-spark icon indicating copy to clipboard operation
ruby-spark copied to clipboard

Block or Proc?

Open ondra-m opened this issue 9 years ago • 3 comments

What is better way how to define a function on RDD?

# as Proc
rdd.map(lambda{|x| x*2})

# as block
rdd.map {|x| x*2}

Which method should be supported?

As Proc:

  • the same way as in Python
  • currently implemented

As block:

  • what about aggregate(zero_value, seq_op, comb_op)
  • method needs 2 function

Both:

  • what about reduce_by_key(f, num_partitions=nil)

  • if you would like to use block and num_partitions:

      rdd.reduce_by_key(nil, 2){|x,y| x+y}
    

ondra-m avatar Apr 12 '15 05:04 ondra-m

Optimal would be supporting all variants, you can easily decide whether a block was passed with block_given?.

And moreover adding 1.9 lambda syntax:

rdd.map(&->(x){x*2})

which is syntactic sugar, however much easier to write.

deric avatar Apr 12 '15 07:04 deric

  1. ->(x){x*2} cannot be serialized (for now)

  2. What about .aggregate(zero_value, seq_op, comb_op)?

      # currently
      seq = lambda{|x,y| x+y}
      com = lambda{|x,y| x*y}
    
      rdd.aggregate(1, seq, com)
    

    What will be a block? Seq_op, comb_op or nothing?

ondra-m avatar Apr 12 '15 08:04 ondra-m

Great gem. BTW today I just found an interesting way to pass numbers by using format

v=100
RDD.map( "lambda{ |x| x / %d }" % v )

%s might be good for string too, have not tested.

xjlin0 avatar Aug 23 '15 19:08 xjlin0