mango
mango copied to clipboard
Mango Build Error
Hello,
I am new to genomics project. I am running Mango and encountering build errors. Any help is greatly appreciated.
I have the following setup:
Package Versions:
- Python 2.7.5
- java version "1.8.0_171"
- Scala code runner version 2.11.12
- Hadoop 3.1.0
- Spark 2.3.1
- npm 3.10.10**
My .bashrc file entries
export JAVA_HOME=/usr
export SPARK_HOME=/opt/spark/spark-2.3.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
ASSEMBLY_DIR=/home/hadoop/mango/mango-assembly/target
ASSEMBLY_JAR="$(ls -1 "$ASSEMBLY_DIR" | grep "^mango-assembly[0-9A-Za-z\_\.-]*\.jar$" | grep -v javadoc | grep -v sources || true)"
export PYSPARK_SUBMIT_ARGS="--jars ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} --driver-class-path ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} pyspark-shell"
Command: mvn package -P python
BUILD ERROR
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_coverage_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_fragment_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_indel_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_indel_distribution_maximal_bin_size FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_indel_distribution_no_elements FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_mapq_distribution FAILED
bdgenomics/mango/test/alignment_test.py::AlignmentTest::test_visualize_alignments FAILED
bdgenomics/mango/test/coverage_test.py::CoverageTest::test_coverage_distribution FAILED
bdgenomics/mango/test/coverage_test.py::CoverageTest::test_example_coverage FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_cumulative_count_distribution FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_fail_on_invalid_sample FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_normalized_count_distribution FAILED
bdgenomics/mango/test/distribution_test.py::DistributionTest::test_sampling FAILED
bdgenomics/mango/test/feature_test.py::FeatureTest::test_visualize_features FAILED
bdgenomics/mango/test/notebook_test.py::NotebookTest::test_alignment_example FAILED
bdgenomics/mango/test/notebook_test.py::NotebookTest::test_coverage_example FAILED
bdgenomics/mango/test/notebook_test.py::NotebookTest::test_example FAILED
bdgenomics/mango/test/variant_test.py::VariantTest::test_visualize_variants FAILED
=================================== FAILURES ===================================
___________________ AlignmentTest.test_coverage_distribution ___________________
bdgenomics/mango/test/__init__.py:65: in setUp
self.ss = SparkSession.builder.master('local[4]').appName(class_name).getOrCreate()
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/session.py:173: in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/context.py:343: in getOrCreate
SparkContext(conf=conf or SparkConf())
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/context.py:115: in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/context.py:292: in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
conf = <pyspark.conf.SparkConf object at 0x7f8fcdb7fd10>
def launch_gateway(conf=None):
"""
launch jvm gateway
:param conf: spark configuration passed to spark-submit
:return:
"""
if "PYSPARK_GATEWAY_PORT" in os.environ:
gateway_port = int(os.environ["PYSPARK_GATEWAY_PORT"])
gateway_secret = os.environ["PYSPARK_GATEWAY_SECRET"]
else:
SPARK_HOME = _find_spark_home()
# Launch the Py4j gateway using Spark's run command so that we pick up the
# proper classpath and settings from spark-env.sh
on_windows = platform.system() == "Windows"
script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
command = [os.path.join(SPARK_HOME, script)]
if conf:
for k, v in conf.getAll():
command += ['--conf', '%s=%s' % (k, v)]
submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
if os.environ.get("SPARK_TESTING"):
submit_args = ' '.join([
"--conf spark.ui.enabled=false",
submit_args
])
command = command + shlex.split(submit_args)
# Create a temporary directory where the gateway server should write the connection
# information.
conn_info_dir = tempfile.mkdtemp()
try:
fd, conn_info_file = tempfile.mkstemp(dir=conn_info_dir)
os.close(fd)
os.unlink(conn_info_file)
env = dict(os.environ)
env["_PYSPARK_DRIVER_CONN_INFO_PATH"] = conn_info_file
# Launch the Java gateway.
# We open a pipe to stdin so that the Java gateway can die when the pipe is broken
if not on_windows:
# Don't send ctrl-c / SIGINT to the Java gateway:
def preexec_func():
signal.signal(signal.SIGINT, signal.SIG_IGN)
proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env)
else:
# preexec_fn not supported on Windows
proc = Popen(command, stdin=PIPE, env=env)
# Wait for the file to appear, or for the process to exit, whichever happens first.
while not proc.poll() and not os.path.isfile(conn_info_file):
time.sleep(0.1)
if not os.path.isfile(conn_info_file):
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
/opt/spark/spark-2.3.1-bin-hadoop2.7/python/pyspark/java_gateway.py:93: Exception
----------------------------- Captured stderr call -----------------------------
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/mango/mango-assembly/target/mango-assembly-0.0.2-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/spark-2.3.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-10-12 11:20:03 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Hi @ssabnis ! Have your run make prepare from the mango-python directory? Also, are you running in a virtual environment?
@akmorrow13 thanks for quick reply, I did build the make prepare, I am not running in virtual environment.
I think this is a Spark versioning issue. You are using Spark 2.3.1, but Mango is pre-built for Spark 2.2.1. More specifically, Spark 2.3.1 uses a new version of py4j (0.10.7) that removed _PYSPARK_DRIVER_CALLBACK_HOST. However, Spark 2.2.1 uses py4j version 0.10.4. To fix this issue, try updating the Mango pom to run on your installed hadoop and spark versions and recompile.
@akmorrow13 thank you, I also came across the following. ./scripts/move_to_spark2.sh is it required to run in order to use the spark 2.3.1 ? I will update the pom and test it again.
@akmorrow13 looks like hadoop 3.1.0 , spark 2.3.1 and Parquet 1.8.2 have an issue, I get a different error now.
java.lang.NoSuchMethodError: org.apache.parquet.column.statistics.Statistics.getBuilderForReading(Lorg/apache/parquet/schema/PrimitiveType$PrimitiveTypeName;)Lorg/apache/parquet/column/statistics/Statistics$Builder;
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:340)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:365)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:821)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:798)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:484)
at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:568)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:492)
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:166)
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:186)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
Spark 2.3.1 uses Parquet 1.10.0 https://github.com/apache/spark/blob/master/pom.xml#L132, so you would have to change to this in the Mango pom as well.
Just a warning, Mango has not been tested yet with these newer versions.
@akmorrow13 you are right, now I get BROTLI codec error
java.lang.NoSuchFieldError: BROTLI
at org.apache.parquet.hadoop.metadata.CompressionCodecName.<clinit>(CompressionCodecName.java:31)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:821)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:798)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:484)
at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:568)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:492)
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:166)
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:186)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
pom.xml entries
<properties>
<adam.version>0.24.0</adam.version>
<avro.version>1.8.1</avro.version>
<bdg-formats.version>0.11.3</bdg-formats.version>
<bdg-utils.version>0.2.13</bdg-utils.version>
<convert.version>0.3.0</convert.version>
<java.version>1.8</java.version>
<jetty.version>9.2.17.v20160517</jetty.version>
<ga4gh.version>0.6.0a10</ga4gh.version>
<hadoop.version>3.1.0</hadoop.version>
<hadoop-bam.version>7.9.2</hadoop-bam.version>
<htsjdk.version>2.9.1</htsjdk.version>
<parquet.version>1.10.0</parquet.version>
<scala.version>2.11.12</scala.version>
<scala.version.prefix>2.11</scala.version.prefix>
<scalatra.version>2.4.1</scalatra.version>
<spark.version>2.3.1</spark.version>
<spark.version.prefix>-spark2_</spark.version.prefix>
<snappy.version>1.0.5</snappy.version>
<scoverage.plugin.version>1.1.1</scoverage.plugin.version>
<protobuf.version>3.0.0-beta-3</protobuf.version>
</properties>
@akmorrow13 I have BAM file in HDFS, I need to visualize it using Mango, any suggestions to get over the issue and make UI work using mango?
@akmorrow13 thanks for the help. I changed the spark version to 2.2.1 and reconfigured. I get the FAILED tests. Any clue. I am attaching the build output file. Thanks output.err2.zip
@ssabnis can you please post the errors in github? It is easiest for debugging and issue documentation.
self._jvm.org.bdgenomics.adam.rdd.ADAMContext.ADAMContextFromSession(ss._jsparkSession)
E TypeError: 'JavaPackage' object is not callable
generally means that python cannot find the jar file.
Make sure you have correctly set:
ASSEMBLY_DIR=/home/hadoop/mango/mango-assembly/target
ASSEMBLY_JAR="$(ls -1 "$ASSEMBLY_DIR" | grep "^mango-assembly[0-9A-Za-z\_\.-]*\.jar$" | grep -v javadoc | grep -v sources || true)"
export PYSPARK_SUBMIT_ARGS="--jars ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} --driver-class-path ${ASSEMBLY_DIR}/${ASSEMBLY_JAR} pyspark-shell"
And that echo $ASSEMBLY_DIR/$ASSEMBLY_JAR correctly points to the compiled jar.
@akmorrow13 I am able to compile, I forgot to do the mvn clean package before the creating Python compile. All good now. Thanks.
is there Mango submit command to use a BAM on the local HDFS/Spark that I have. Any reference will help.
What are steps and tools to visualize the Genome BAM files.
Please take a look at our readthedocs . Under usage and examples, there is both a python and browser based tool that allow visualization of bam files.
@akmorrow13 one last question you may help. is there a large genome data set that I can use with mango to visualize? Any references will help.
@ssabnis one free dataset that you can access is the 1000 genomes dataset. If you are running on AWS, it is hosted there. You can see Mango's aws notebook tutorial which accesses these files. Instructions for running on AWS can be found on readthedocs .