popstrat icon indicating copy to clipboard operation
popstrat copied to clipboard

The newer version of spark,adam,sparklingwater for "Genomic Analysis Using ADAM, Spark and Deep Learning" to the people who want to reproduce the test

Open car2008 opened this issue 8 years ago • 1 comments

Now i have some advice for Genomic Analysis Using ADAM, Spark and Deep Learning to the people who want to reproduce the test using the newer version tools:

car2008 avatar Aug 26 '16 05:08 car2008

Hi @nfergu ,i have some advice for Genomic Analysis Using ADAM, Spark and Deep Learning to the people who want to reproduce the test .So i post all the changes here ,and i hope it's helpful to others: first, in the .pom file :

  • Spark version 1.6.1 replacing 1.2.0
  • ADAM version 0.19.0 replacing 0.16.0
  • Sparkling Water version 1.6.5 replacing 1.2.5
  • H2O version 3.8.2.6 replacing 3.0.0.8(we can only modify the version and don't install it after we have installed Sparkling Water)
<dependency>
        <groupId>org.bdgenomics.adam</groupId>
        <artifactId>adam-core</artifactId>
        <version>${adam.version}</version>
</dependency>
<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-apis</artifactId>
         <version>${adam.version}</version>
</dependency>

is modified to

<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-core_2.10</artifactId>
         <version>${adam.version}</version>
</dependency>
<dependency>
         <groupId>org.bdgenomics.adam</groupId>
         <artifactId>adam-apis_2.10</artifactId>
         <version>${adam.version}</version>
</dependency>

then ,in the codes :

val header = StructType(Array(StructField("Region", StringType)) ++
      sortedVariantsBySampleId.first()._2.map(variant => {StructField(variant.variantId.toString, IntegerType)}))

is modified to

val header = DataTypes.createStructType(Array(DataTypes.createStructField("Region", DataTypes.StringType,false)) ++
      sortedVariantsBySampleId.first()._2.map(variant => {DataTypes.createStructField(variant.variantId.toString,DataTypes.IntegerType,false)}))
// Create the SchemaRDD from the header and rows and convert the SchemaRDD into a H2O dataframe
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    val schemaRDD = sqlContext.applySchema(rowRDD, header)
    val h2oContext = new H2OContext(sc).start()
    import h2oContext._
    val dataFrame = h2oContext.toDataFrame(schemaRDD)

is modified to

// Create the SchemaRDD from the header and rows and convert the SchemaRDD into a H2O dataframe
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    //val dataFrame=sqlContext.createDataFrame(rowRDD, header)
    val schemaRDD = sqlContext.applySchema(rowRDD, header)
    val h2oContext = new H2OContext(sc).start()
    import h2oContext._ 
    val dataFrame1 =h2oContext.asH2OFrame(schemaRDD)
    val dataFrame=H2OFrameSupport.allStringVecToCategorical(dataFrame1)
// Split the dataframe into 50% training, 30% test, and 20% validation data
    val frameSplitter = new FrameSplitter(dataFrame, Array(.5, .3), Array("training", "test", "validation").map(Key.make), null)

is modified to

// Split the dataframe into 50% training, 30% test, and 20% validation data
   val frameSplitter = new FrameSplitter(dataFrame, Array(.5, .3), Array("training", "test", "validation").map(Key.make[Frame](_)), null)
// Set the parameters for our deep learning model.
    val deepLearningParameters = new DeepLearningParameters()
    deepLearningParameters._train = training
    deepLearningParameters._valid = validation

is modified to

// Set the parameters for our deep learning model.
    val deepLearningParameters = new DeepLearningParameters()
    deepLearningParameters._train = training._key
    deepLearningParameters._valid = validation._key
// Score the model against the entire dataset (training, test, and validation data)
    // This causes the confusion matrix to be printed
    deepLearningModel.score(dataFrame)('predict)

is modified to

// Score the model against the entire dataset (training, test, and validation data)
    // This causes the confusion matrix to be printed
    deepLearningModel.score(dataFrame)

    Add

import org.apache.spark.sql.types.DataTypes
import hex._
import water.fvec._
import water.support._
import _root_.hex.Distribution.Family
import _root_.hex.deeplearning.DeepLearningModel
import _root_.hex.tree.gbm.GBMModel
import _root_.hex.{Model, ModelMetricsBinomial}

ok ,that's all, i have tested it successfully ,it will be better if you have other advice . Thank you again !

car2008 avatar Aug 26 '16 05:08 car2008