mango Mango Build Error

Hello,

I am new to genomics project. I am running Mango and encountering build errors. Any help is greatly appreciated.

I have the following setup:

Package Versions:

- Python 2.7.5
- java version "1.8.0_171"
- Scala code runner version 2.11.12
- Hadoop 3.1.0
- Spark 2.3.1
- npm  3.10.10**

My .bashrc file entries

export JAVA_HOME=/usr
export SPARK_HOME=/opt/spark/spark-2.3.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

ASSEMBLY_DIR=/home/hadoop/mango/mango-assembly/target
ASSEMBLY_JAR="$(ls -1 "$ASSEMBLY_DIR" | grep "^mango-assembly[0-9A-Za-z\_\.-]*\.jar$" | grep -v javadoc | grep -v sources || true)"
export PYSPARK_SUBMIT_ARGS="--jars ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} --driver-class-path ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} pyspark-shell"

Command: mvn package -P python

BUILD ERROR

bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_coverage_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_fragment_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_indel_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_indel_distribution_maximal_bin_size FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_indel_distribution_no_elements FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_mapq_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_visualize_alignments FAILED
bdgenomics/mango/test/coverage_test.py::CoverageTest::test_coverage_distribution FAILED
bdgenomics/mango/test/coverage_test.py::CoverageTest::test_example_coverage FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_cumulative_count_distribution FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_fail_on_invalid_sample FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_normalized_count_distribution FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_sampling FAILED
bdgenomics/mango/test/feature_test.py::FeatureTest::test_visualize_features FAILED
bdgenomics/mango/test/notebook_test.py::NotebookTest::test_alignment_example FAILED
bdgenomics/mango/test/notebook_test.py::NotebookTest::test_coverage_example FAILED
bdgenomics/mango/test/notebook_test.py::NotebookTest::test_example FAILED
bdgenomics/mango/test/variant_test.py::VariantTest::test_visualize_variants FAILED

=================================== FAILURES ===================================
___________________ AlignmentTest.test_coverage_distribution ___________________
bdgenomics/mango/test/__init__.py:65: in setUp
    self.ss = SparkSession.builder.master('local[4]').appName(class_name).getOrCreate()
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/session.py:173: in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/context.py:343: in getOrCreate
    SparkContext(conf=conf or SparkConf())
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/context.py:115: in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/context.py:292: in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

conf = <pyspark.conf.SparkConf object at 0x7f8fcdb7fd10>

    def launch_gateway(conf=None):
        """
        launch jvm gateway
        :param conf: spark configuration passed to spark-submit
        :return:
        """
        if "PYSPARK_GATEWAY_PORT" in os.environ:
            gateway_port = int(os.environ["PYSPARK_GATEWAY_PORT"])
            gateway_secret = os.environ["PYSPARK_GATEWAY_SECRET"]
        else:
            SPARK_HOME = _find_spark_home()
            # Launch the Py4j gateway using Spark's run command so that we pick up the
            # proper classpath and settings from spark-env.sh
            on_windows = platform.system() == "Windows"
            script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
            command = [os.path.join(SPARK_HOME, script)]
            if conf:
                for k, v in conf.getAll():
                    command += ['--conf', '%s=%s' % (k, v)]
            submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
            if os.environ.get("SPARK_TESTING"):
                submit_args = ' '.join([
                    "--conf spark.ui.enabled=false",
                    submit_args
                ])
            command = command + shlex.split(submit_args)

            # Create a temporary directory where the gateway server should write the connection
            # information.
            conn_info_dir = tempfile.mkdtemp()
            try:
                fd, conn_info_file = tempfile.mkstemp(dir=conn_info_dir)
                os.close(fd)
                os.unlink(conn_info_file)

                env = dict(os.environ)
                env["_PYSPARK_DRIVER_CONN_INFO_PATH"] = conn_info_file

                # Launch the Java gateway.
                # We open a pipe to stdin so that the Java gateway can die when the pipe is broken
                if not on_windows:
                    # Don't send ctrl-c / SIGINT to the Java gateway:
                    def preexec_func():
                        signal.signal(signal.SIGINT, signal.SIG_IGN)
                    proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env)
                else:
                    # preexec_fn not supported on Windows
                    proc = Popen(command, stdin=PIPE, env=env)

                # Wait for the file to appear, or for the process to exit, whichever happens first.
                while not proc.poll() and not os.path.isfile(conn_info_file):
                    time.sleep(0.1)

                if not os.path.isfile(conn_info_file):
                   raise Exception("Java gateway process exited before sending its port number")
                  Exception: Java gateway process exited before sending its port number

/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/java_gateway.py:93: Exception
----------------------------- Captured stderr call -----------------------------
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/mango/mango-assembly/target/mango-assembly-0.0.2-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/spark-2.3.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-10-12 11:20:03 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:59)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:59)
        at org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
        at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
        at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
        at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Oct 12 '18 18:10 ssabnis

Hi @ssabnis ! Have your run make prepare from the mango-python directory? Also, are you running in a virtual environment?

Oct 12 '18 18:10 akmorrow13

@akmorrow13 thanks for quick reply, I did build the make prepare, I am not running in virtual environment.

Oct 12 '18 18:10 ssabnis

I think this is a Spark versioning issue. You are using Spark 2.3.1, but Mango is pre-built for Spark 2.2.1. More specifically, Spark 2.3.1 uses a new version of py4j (0.10.7) that removed _PYSPARK_DRIVER_CALLBACK_HOST. However, Spark 2.2.1 uses py4j version 0.10.4. To fix this issue, try updating the Mango pom to run on your installed hadoop and spark versions and recompile.

Oct 12 '18 19:10 akmorrow13

@akmorrow13 thank you, I also came across the following. ./scripts/move_to_spark2.sh is it required to run in order to use the spark 2.3.1 ? I will update the pom and test it again.

Oct 12 '18 19:10 ssabnis

@akmorrow13 looks like hadoop 3.1.0 , spark 2.3.1 and Parquet 1.8.2 have an issue, I get a different error now.

java.lang.NoSuchMethodError: org.apache.parquet.column.statistics.Statistics.getBuilderForReading(Lorg/apache/parquet/schema/PrimitiveType$PrimitiveTypeName;)Lorg/apache/parquet/column/statistics/Statistics$Builder;
        at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:340)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:365)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:821)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:798)
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:484)
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:568)
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:492)
        at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:166)
        at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:186)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)

Oct 12 '18 19:10 ssabnis

Spark 2.3.1 uses Parquet 1.10.0 https://github.com/apache/spark/blob/master/pom.xml#L132, so you would have to change to this in the Mango pom as well.

Just a warning, Mango has not been tested yet with these newer versions.

Oct 12 '18 19:10 akmorrow13

@akmorrow13 you are right, now I get BROTLI codec error

java.lang.NoSuchFieldError: BROTLI
        at org.apache.parquet.hadoop.metadata.CompressionCodecName.<clinit>(CompressionCodecName.java:31)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:821)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:798)
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:484)
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:568)
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:492)
        at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:166)
        at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:186)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)

pom.xml entries

  <properties>
    <adam.version>0.24.0</adam.version>
    <avro.version>1.8.1</avro.version>
    <bdg-formats.version>0.11.3</bdg-formats.version>
    <bdg-utils.version>0.2.13</bdg-utils.version>
    <convert.version>0.3.0</convert.version>
    <java.version>1.8</java.version>
    <jetty.version>9.2.17.v20160517</jetty.version>
    <ga4gh.version>0.6.0a10</ga4gh.version>
    <hadoop.version>3.1.0</hadoop.version>
    <hadoop-bam.version>7.9.2</hadoop-bam.version>
    <htsjdk.version>2.9.1</htsjdk.version>
    <parquet.version>1.10.0</parquet.version>
    <scala.version>2.11.12</scala.version>
    <scala.version.prefix>2.11</scala.version.prefix>
    <scalatra.version>2.4.1</scalatra.version>
    <spark.version>2.3.1</spark.version>
    <spark.version.prefix>-spark2_</spark.version.prefix>
    <snappy.version>1.0.5</snappy.version>
    <scoverage.plugin.version>1.1.1</scoverage.plugin.version>
    <protobuf.version>3.0.0-beta-3</protobuf.version>
  </properties>

Oct 12 '18 20:10 ssabnis

@akmorrow13 I have BAM file in HDFS, I need to visualize it using Mango, any suggestions to get over the issue and make UI work using mango?

Oct 12 '18 20:10 ssabnis

@akmorrow13 thanks for the help. I changed the spark version to 2.2.1 and reconfigured. I get the FAILED tests. Any clue. I am attaching the build output file. Thanks output.err2.zip

Oct 12 '18 23:10 ssabnis

@ssabnis can you please post the errors in github? It is easiest for debugging and issue documentation.

self._jvm.org.bdgenomics.adam.rdd.ADAMContext.ADAMContextFromSession(ss._jsparkSession)
E       TypeError: 'JavaPackage' object is not callable

generally means that python cannot find the jar file.

Make sure you have correctly set:

ASSEMBLY_DIR=/home/hadoop/mango/mango-assembly/target
ASSEMBLY_JAR="$(ls -1 "$ASSEMBLY_DIR" | grep "^mango-assembly[0-9A-Za-z\_\.-]*\.jar$" | grep -v javadoc | grep -v sources || true)"
export PYSPARK_SUBMIT_ARGS="--jars ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} --driver-class-path ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} pyspark-shell"

And that echo $ASSEMBLY_DIR/$ASSEMBLY_JAR correctly points to the compiled jar.

Oct 13 '18 00:10 akmorrow13

@akmorrow13 I am able to compile, I forgot to do the mvn clean package before the creating Python compile. All good now. Thanks.

is there Mango submit command to use a BAM on the local HDFS/Spark that I have. Any reference will help.

Oct 14 '18 15:10 ssabnis

What are steps and tools to visualize the Genome BAM files.

Oct 16 '18 00:10 ssabnis

Please take a look at our readthedocs . Under usage and examples, there is both a python and browser based tool that allow visualization of bam files.

Oct 16 '18 00:10 akmorrow13

@akmorrow13 one last question you may help. is there a large genome data set that I can use with mango to visualize? Any references will help.

Oct 16 '18 21:10 ssabnis

@ssabnis one free dataset that you can access is the 1000 genomes dataset. If you are running on AWS, it is hosted there. You can see Mango's aws notebook tutorial which accesses these files. Instructions for running on AWS can be found on readthedocs .

Oct 16 '18 22:10 akmorrow13

mango mango copied to clipboard

Mango Build Error

mango
mango copied to clipboard