guacamole
guacamole copied to clipboard
Mutect implementation work
An initial rough version of the MuTect algorithm, mostly ported over from work on Avocado.
Thanks for the PR @jstjohn! One of @timodonnell, @arahuja, or I will have a look this week.
Any interest in a quick call to discuss your interest in Guacamole? Shoot me an email (address on my profile) if so!
Thanks for the contribution @jstjohn . Looks awesome. Couple of quick comments (we'll add more as people look at this in more depth):
- I think it'd be helpful to document what subset of features of Mutect 1 this implements, and any known differences between these implementations. For example, is filtering by the panel of normals supported (or is that what the noisy vcf is for?)? I think this could just be a big header comment in
SomaticMutectLikeCaller.scala
, or potentially added as docs. It would also be useful to indicate (maybe on the README) about how mature we can expect this code to be, e.g. how many full datasets has it been run on and compared against mutect calls? - I think a rebase is needed -- this PR seems to include some unrelated commits from me and Arun.
Code style
- We've been trying to follow the convention of avoiding abbreviations, except for extremely common things like
num
. Renaming some of your names (e.g.pu -> pileupElement
) would help stay consistent with this. - We've also been trying to keep a correct header comment with every non-trivial method.
Also, I tried running:
$ scripts/guacamole somatic-mutect-like \
--normal-reads src/test/resources/synth1.normal.100k-200k.withmd.bam \
--tumor-reads src/test/resources/synth1.tumor.100k-200k.withmd.bam \
--out /tmp/out.vcf
And I got this serialization error:
--> [0.02 sec. later]: Loci partitioning: 1:0-249250620=0,10:0-135534746=0,11:0-135006515=0,12:0-133851894=0,13:0-115169877=0,14:0-107349539=0,15:0-102531391=0,16:0-90354752=0,17:0-81195209=0,18:0-78077247=0,19:0-59128982=0,2:0-243199372=0,20:0-63025519=0,21:0-48129894=0,22:0-51304565=0,3:0-198022429=0,4:0-191154275=0,5:0-180915259=0,6:0-171115066=0,7:0-159138662=0,8:0-146364021=0,9:0-141213430=0,GL000191.1:0-106432=0,GL000192.1:0-547495=0,GL000193.1:0-189788=0,GL000194.1:0-191468=0,GL000195.1:0-182895=0,GL000196.1:0-38913=0,GL000197.1:0-37174=0, [...]
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:742)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:741)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:741)
at org.hammerlab.guacamole.DistributedUtil$.windowTaskFlatMapMultipleRDDs(DistributedUtil.scala:628)
at org.hammerlab.guacamole.DistributedUtil$.windowFlatMapWithState(DistributedUtil.scala:395)
at org.hammerlab.guacamole.DistributedUtil$.pileupFlatMapTwoRDDs(DistributedUtil.scala:323)
at org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$.run(SomaticMutectLikeCaller.scala:192)
at org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$.run(SomaticMutectLikeCaller.scala:154)
at org.hammerlab.guacamole.SparkCommand.run(Command.scala:55)
at org.hammerlab.guacamole.Command.run(Command.scala:46)
at org.hammerlab.guacamole.Guacamole$.main(Guacamole.scala:71)
at org.hammerlab.guacamole.Guacamole.main(Guacamole.scala)
Caused by: java.io.NotSerializableException: org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments
Serialization stack:
- object not serializable (class: org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments, value: org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments@3611153f)
- field (class: org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$$anonfun$6, name: args$1, type: class org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments)
- object (class org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$$anonfun$6, <function2>)
- field (class: org.hammerlab.guacamole.DistributedUtil$$anonfun$pileupFlatMapTwoRDDs$1, name: function$3, type: interface scala.Function2)
- object (class org.hammerlab.guacamole.DistributedUtil$$anonfun$pileupFlatMapTwoRDDs$1, <function2>)
- field (class: org.hammerlab.guacamole.DistributedUtil$$anonfun$windowFlatMapWithState$1, name: function$2, type: interface scala.Function2)
- object (class org.hammerlab.guacamole.DistributedUtil$$anonfun$windowFlatMapWithState$1, <function3>)
- field (class: org.hammerlab.guacamole.DistributedUtil$$anonfun$windowTaskFlatMapMultipleRDDs$3, name: function$5, type: interface scala.Function3)
- object (class org.hammerlab.guacamole.DistributedUtil$$anonfun$windowTaskFlatMapMultipleRDDs$3, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 18 more
12.95 real 28.89 user 2.31 sys
Do you get that as well? Our commandline parsing library (args4j) is kind of a hassle, and I haven't found a way to make the parsed arguments be serializable. The solution to this is either to copy the args over to vals when they are referenced in a Spark serialized closure, or else copy all of the args over to case class (which is serializable).
The solution to this is either to copy the args over to vals when they are referenced in a Spark serialized closure, or else copy all of the args over to case class (which is serializable).
Adding a with Serializable
to Arguments
would fix the ser error too, right?
The serializable issue seems to be fixed with the suggested with Serializable
addition to Arguments
. I have also gone through and updated many of the function names/added docs to the functions. I uploaded a README to the docs
folder outlining the MuTect algorithm and the differences I am aware of.
Review status: 0 of 17 files reviewed at latest revision, 3 unresolved discussions.
src/main/scala/org/hammerlab/guacamole/commands/SomaticMutectLikeCaller.scala, line 303 [r3] (raw file): Done.
src/main/scala/org/hammerlab/guacamole/commands/SomaticMutectLikeCaller.scala, line 479 [r3] (raw file): Done.
Comments from the review on Reviewable.io