ruby-spark Not making use of multiple cores

Not making use of multiple cores

Open gnilrets opened this issue 9 years ago • 4 comments

I've got a file with 8M records and I'm trying to split it up into words and do a word count. Here's my code. When I run it, I see 4 new Ruby processes start up on my machine but only one of them shoots to 100%. The others just sit there idle. I don't think it's parallelizing properly. Am I missing a configuration setting somewhere?

require 'ruby-spark'
Spark.config do
  set_app_name 'RubySpark'
  set_master 'local[*]'
  set 'spark.ruby.serializer', 'oj'
  set 'spark.ruby.serializer.batch_size', 2048
end
Spark.start
sc = Spark.sc

tfile = sc.text_file('work/Contact.csv')
words = tfile.flat_map('lambda { |x| x.downcase.gsub(/[^a-z]/, " ").split(" ")}')
words.count