mango
mango copied to clipboard
Mango cannot load fasta files with pipes in the contigName
If I run mango-submit passing a .fasta file as reference:
/users/bbooth/src/mango/bin/mango-submit /data/seqdata/analysis/fakereads/S288C_reference_genome_R64-2-1_20150113/S288C_reference_sequence_R64-2-1_20150113.fsa.fasta -features /data/seqdata/analysis/fakereads//data/seqdata/analysis/fakereads/S288C_reference_genome_R64-2-1_20150113/saccharomyces_cerevisiae_R64-2-1_20150113.genes.gff3
The fasta file has chromosome names with pipes, e.g.:
ref|NC_001141|
ref|NC_001136|
ref|NC_001135|
ref|NC_001144|
..., etc.
Then I get this error:
Command body threw exception:
java.lang.AssertionError: assertion failed: SequenceRecord.name is null or empty
Exception in thread "main" java.lang.AssertionError: assertion failed: SequenceRecord.name is null or empty
at scala.Predef$.assert(Predef.scala:170)
at org.bdgenomics.adam.models.SequenceRecord.<init>(SequenceDictionary.scala:287)
at org.bdgenomics.adam.models.SequenceRecord$.apply(SequenceDictionary.scala:403)
at org.bdgenomics.adam.util.ReferenceContigMap$$anonfun$1.apply(ReferenceContigMap.scala:51)
at org.bdgenomics.adam.util.ReferenceContigMap$$anonfun$1.apply(ReferenceContigMap.scala:50)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.bdgenomics.adam.util.ReferenceContigMap.<init>(ReferenceContigMap.scala:50)
at org.bdgenomics.adam.util.ReferenceContigMap$.apply(ReferenceContigMap.scala:107)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadReferenceFile$1.apply(ADAMContext.scala:3010)
at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadReferenceFile$1.apply(ADAMContext.scala:3007)
at scala.Option.fold(Option.scala:158)
at org.apache.spark.rdd.Timer.time(Timer.scala:48)
at org.bdgenomics.adam.rdd.ADAMContext.loadReferenceFile(ADAMContext.scala:3005)
at org.bdgenomics.mango.models.AnnotationMaterialization.<init>(AnnotationMaterialization.scala:42)
at org.bdgenomics.mango.cli.VizReads.initAnnotations(VizReads.scala:638)
at org.bdgenomics.mango.cli.VizReads.run(VizReads.scala:586)
at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
at org.bdgenomics.mango.cli.VizReads.run(VizReads.scala:579)
at org.bdgenomics.utils.cli.BDGCommandCompanion$class.main(BDGCommand.scala:33)
at org.bdgenomics.mango.cli.VizReads$.main(VizReads.scala:69)
at org.bdgenomics.mango.cli.VizReads.main(VizReads.scala)
This is due to the following lines in FastaConverter.parseDescriptionLine:
// is this description metadata or not? if it is metadata, it will contain "|"
if (split._1.contains('|')) {
(None, Some(dL.stripPrefix(">").trim))
If a pipe character appears in the contig name, then the NucleotideFragment doesn't get a name, but only gets a description with the name included. This seems counterintuitive.
If there is no contigName, then mango doesn't know how to handle it. It seems obvious that fasta files should always get a contigName, even if the name contains a pipe character.
Converting the fasta file to two-bit format works as a workaround for this case.
Hi @benwbooth, thanks for the catch! This looks like it is a bug in ADAM FastaConverter, not Mango. Can you make an issue there so we can track it?
In general, twoBit files are a little nicer to work with for the browser, due to their smaller size and responsiveness.