ruby-spark Block or Proc?

Block or Proc?

Open ondra-m opened this issue 9 years ago • 3 comments

What is better way how to define a function on RDD?

# as Proc
rdd.map(lambda{|x| x*2})

# as block
rdd.map {|x| x*2}

Which method should be supported?

As Proc:

As block:

Both:

what about reduce_by_key(f, num_partitions=nil)
if you would like to use block and num_partitions:
```
  rdd.reduce_by_key(nil, 2){|x,y| x+y}
```

Apr 12 '15 05:04 ondra-m

Optimal would be supporting all variants, you can easily decide whether a block was passed with block_given?.

And moreover adding 1.9 lambda syntax:

rdd.map(&->(x){x*2})

which is syntactic sugar, however much easier to write.

Apr 12 '15 07:04 deric

What about .aggregate(zero_value, seq_op, comb_op)?

  # currently
  seq = lambda{|x,y| x+y}
  com = lambda{|x,y| x*y}

  rdd.aggregate(1, seq, com)

What will be a block? Seq_op, comb_op or nothing?

Apr 12 '15 08:04 ondra-m

Great gem. BTW today I just found an interesting way to pass numbers by using format

v=100
RDD.map( "lambda{ |x| x / %d }" % v )

%s might be good for string too, have not tested.

Aug 23 '15 19:08 xjlin0