mango icon indicating copy to clipboard operation
mango copied to clipboard

use SparkSQL with Mango / move to Spark 2.x

Open jpdna opened this issue 7 years ago • 4 comments

I want to try Mango with the ADAM Hive partitioning PR: https://github.com/bigdatagenomics/adam/pull/1620

This is going to require some more changes than just bumping scala and spark versions, error I get now during compile is:

/Users/jpaschall/ADAM/jp_adam_hive/mango/mango/mango-core/src/main/scala/org/bdgenomics/mango/models/AlignmentRecordMaterialization.scala:217: error: not found: value predicate
[ERROR]       sc.loadParquetAlignments(fp, predicate = pred, projection = Some(proj))
[ERROR]                                    ^
[ERROR] /Users/jpaschall/ADAM/jp_adam_hive/mango/mango/mango-core/src/main/scala/org/bdgenomics/mango/models/AlignmentRecordMaterialization.scala:217: error: not found: value projection
[ERROR]       sc.loadParquetAlignments(fp, predicate = pred, projection = Some(proj))

as parameter name changed changed from predicate to optPredicate

I can start trying to resolve problems like the one above one by one, but if anyone has tried to make these updates already or has suggestions, just let me know here.

jpdna avatar Jul 27 '17 22:07 jpdna

Thanks @jpdna for bringing this up! Moving to Spark 2.X seems like a necessary change. Is there are reason you are interested in SparkSQL? What ways do you see Mango using it for?

akmorrow13 avatar Jul 28 '17 03:07 akmorrow13

I want to use the Hive-stle partitioning of parquet files technique in my recent PR: bigdatagenomics/adam#1620 where there is a very substantial (from 25 seconds to 1 second in one test) improvement in in lookup of a 1 MB region from a whole genome alignment file. That PR needs the recent SQL/dataset stuff and needs Spark 2.1.x or higher.

jpdna avatar Jul 28 '17 10:07 jpdna

This sounds awesome! As a starting point, Jenkins is already integrated to test Spark 2. Just like ADAM, the Spark 2 scripts are found here https://github.com/bigdatagenomics/mango/tree/master/scripts.

akmorrow13 avatar Jul 28 '17 16:07 akmorrow13

@jpdna It seems like most of these version updates/ SQL import you need are addressed in https://github.com/bigdatagenomics/mango/pull/307/files . Besides these version changes, what else is required on the Mango side to get these changes in?

akmorrow13 avatar Sep 07 '17 16:09 akmorrow13