mango
mango copied to clipboard
use SparkSQL with Mango / move to Spark 2.x
I want to try Mango with the ADAM Hive partitioning PR: https://github.com/bigdatagenomics/adam/pull/1620
This is going to require some more changes than just bumping scala and spark versions, error I get now during compile is:
/Users/jpaschall/ADAM/jp_adam_hive/mango/mango/mango-core/src/main/scala/org/bdgenomics/mango/models/AlignmentRecordMaterialization.scala:217: error: not found: value predicate
[ERROR] sc.loadParquetAlignments(fp, predicate = pred, projection = Some(proj))
[ERROR] ^
[ERROR] /Users/jpaschall/ADAM/jp_adam_hive/mango/mango/mango-core/src/main/scala/org/bdgenomics/mango/models/AlignmentRecordMaterialization.scala:217: error: not found: value projection
[ERROR] sc.loadParquetAlignments(fp, predicate = pred, projection = Some(proj))
as parameter name changed changed from predicate
to optPredicate
I can start trying to resolve problems like the one above one by one, but if anyone has tried to make these updates already or has suggestions, just let me know here.
Thanks @jpdna for bringing this up! Moving to Spark 2.X seems like a necessary change. Is there are reason you are interested in SparkSQL? What ways do you see Mango using it for?
I want to use the Hive-stle partitioning of parquet files technique in my recent PR: bigdatagenomics/adam#1620 where there is a very substantial (from 25 seconds to 1 second in one test) improvement in in lookup of a 1 MB region from a whole genome alignment file. That PR needs the recent SQL/dataset stuff and needs Spark 2.1.x or higher.
This sounds awesome! As a starting point, Jenkins is already integrated to test Spark 2. Just like ADAM, the Spark 2 scripts are found here https://github.com/bigdatagenomics/mango/tree/master/scripts.
@jpdna It seems like most of these version updates/ SQL import you need are addressed in https://github.com/bigdatagenomics/mango/pull/307/files . Besides these version changes, what else is required on the Mango side to get these changes in?