ruby-spark icon indicating copy to clipboard operation
ruby-spark copied to clipboard

Not making use of multiple cores

Open gnilrets opened this issue 9 years ago • 4 comments

I've got a file with 8M records and I'm trying to split it up into words and do a word count. Here's my code. When I run it, I see 4 new Ruby processes start up on my machine but only one of them shoots to 100%. The others just sit there idle. I don't think it's parallelizing properly. Am I missing a configuration setting somewhere?

require 'ruby-spark'
Spark.config do
  set_app_name 'RubySpark'
  set_master 'local[*]'
  set 'spark.ruby.serializer', 'oj'
  set 'spark.ruby.serializer.batch_size', 2048
end
Spark.start
sc = Spark.sc

tfile = sc.text_file('work/Contact.csv')
words = tfile.flat_map('lambda { |x| x.downcase.gsub(/[^a-z]/, " ").split(" ")}')
words.count

gnilrets avatar Dec 23 '15 23:12 gnilrets

  1. How big is work/Contact.csv?
  2. What is output of sc.default_parallelism?

ondra-m avatar Dec 24 '15 18:12 ondra-m

1 - work/Contact.csv is 5GB with over 8M rows. 2 - sc.default_parallelism = 4

gnilrets avatar Dec 24 '15 20:12 gnilrets

I am seeing the same issue with MRI.

Can't try jruby since jruby ruby-spark version seems broken at the moment.

bak1an avatar Aug 28 '18 12:08 bak1an

Sorry but currently I don't have time to maintain this library.

Pull request is welcome.

ondra-m avatar Sep 03 '18 05:09 ondra-m