ruby-spark
ruby-spark copied to clipboard
Not making use of multiple cores
I've got a file with 8M records and I'm trying to split it up into words and do a word count. Here's my code. When I run it, I see 4 new Ruby processes start up on my machine but only one of them shoots to 100%. The others just sit there idle. I don't think it's parallelizing properly. Am I missing a configuration setting somewhere?
require 'ruby-spark'
Spark.config do
set_app_name 'RubySpark'
set_master 'local[*]'
set 'spark.ruby.serializer', 'oj'
set 'spark.ruby.serializer.batch_size', 2048
end
Spark.start
sc = Spark.sc
tfile = sc.text_file('work/Contact.csv')
words = tfile.flat_map('lambda { |x| x.downcase.gsub(/[^a-z]/, " ").split(" ")}')
words.count
- How big is work/Contact.csv?
- What is output of
sc.default_parallelism?
1 - work/Contact.csv is 5GB with over 8M rows. 2 - sc.default_parallelism = 4
I am seeing the same issue with MRI.
Can't try jruby since jruby ruby-spark version seems broken at the moment.
Sorry but currently I don't have time to maintain this library.
Pull request is welcome.