guacamole icon indicating copy to clipboard operation
guacamole copied to clipboard

Mutect implementation work

Open jstjohn opened this issue 8 years ago • 4 comments

An initial rough version of the MuTect algorithm, mostly ported over from work on Avocado.


This change is Review on Reviewable

jstjohn avatar Mar 13 '16 03:03 jstjohn

Thanks for the PR @jstjohn! One of @timodonnell, @arahuja, or I will have a look this week.

Any interest in a quick call to discuss your interest in Guacamole? Shoot me an email (address on my profile) if so!

ryan-williams avatar Mar 13 '16 17:03 ryan-williams

Thanks for the contribution @jstjohn . Looks awesome. Couple of quick comments (we'll add more as people look at this in more depth):

  • I think it'd be helpful to document what subset of features of Mutect 1 this implements, and any known differences between these implementations. For example, is filtering by the panel of normals supported (or is that what the noisy vcf is for?)? I think this could just be a big header comment in SomaticMutectLikeCaller.scala, or potentially added as docs. It would also be useful to indicate (maybe on the README) about how mature we can expect this code to be, e.g. how many full datasets has it been run on and compared against mutect calls?
  • I think a rebase is needed -- this PR seems to include some unrelated commits from me and Arun.

Code style

  • We've been trying to follow the convention of avoiding abbreviations, except for extremely common things like num. Renaming some of your names (e.g. pu -> pileupElement) would help stay consistent with this.
  • We've also been trying to keep a correct header comment with every non-trivial method.

Also, I tried running:

$ scripts/guacamole somatic-mutect-like    \
  --normal-reads src/test/resources/synth1.normal.100k-200k.withmd.bam  \
  --tumor-reads src/test/resources/synth1.tumor.100k-200k.withmd.bam   \
   --out /tmp/out.vcf

And I got this serialization error:

--> [0.02 sec. later]: Loci partitioning: 1:0-249250620=0,10:0-135534746=0,11:0-135006515=0,12:0-133851894=0,13:0-115169877=0,14:0-107349539=0,15:0-102531391=0,16:0-90354752=0,17:0-81195209=0,18:0-78077247=0,19:0-59128982=0,2:0-243199372=0,20:0-63025519=0,21:0-48129894=0,22:0-51304565=0,3:0-198022429=0,4:0-191154275=0,5:0-180915259=0,6:0-171115066=0,7:0-159138662=0,8:0-146364021=0,9:0-141213430=0,GL000191.1:0-106432=0,GL000192.1:0-547495=0,GL000193.1:0-189788=0,GL000194.1:0-191468=0,GL000195.1:0-182895=0,GL000196.1:0-38913=0,GL000197.1:0-37174=0, [...]
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:742)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:741)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:741)
    at org.hammerlab.guacamole.DistributedUtil$.windowTaskFlatMapMultipleRDDs(DistributedUtil.scala:628)
    at org.hammerlab.guacamole.DistributedUtil$.windowFlatMapWithState(DistributedUtil.scala:395)
    at org.hammerlab.guacamole.DistributedUtil$.pileupFlatMapTwoRDDs(DistributedUtil.scala:323)
    at org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$.run(SomaticMutectLikeCaller.scala:192)
    at org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$.run(SomaticMutectLikeCaller.scala:154)
    at org.hammerlab.guacamole.SparkCommand.run(Command.scala:55)
    at org.hammerlab.guacamole.Command.run(Command.scala:46)
    at org.hammerlab.guacamole.Guacamole$.main(Guacamole.scala:71)
    at org.hammerlab.guacamole.Guacamole.main(Guacamole.scala)
Caused by: java.io.NotSerializableException: org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments
Serialization stack:
    - object not serializable (class: org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments, value: org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments@3611153f)
    - field (class: org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$$anonfun$6, name: args$1, type: class org.hammerlab.guacamole.commands.SomaticMutectLike$Arguments)
    - object (class org.hammerlab.guacamole.commands.SomaticMutectLike$Caller$$anonfun$6, <function2>)
    - field (class: org.hammerlab.guacamole.DistributedUtil$$anonfun$pileupFlatMapTwoRDDs$1, name: function$3, type: interface scala.Function2)
    - object (class org.hammerlab.guacamole.DistributedUtil$$anonfun$pileupFlatMapTwoRDDs$1, <function2>)
    - field (class: org.hammerlab.guacamole.DistributedUtil$$anonfun$windowFlatMapWithState$1, name: function$2, type: interface scala.Function2)
    - object (class org.hammerlab.guacamole.DistributedUtil$$anonfun$windowFlatMapWithState$1, <function3>)
    - field (class: org.hammerlab.guacamole.DistributedUtil$$anonfun$windowTaskFlatMapMultipleRDDs$3, name: function$5, type: interface scala.Function3)
    - object (class org.hammerlab.guacamole.DistributedUtil$$anonfun$windowTaskFlatMapMultipleRDDs$3, <function2>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    ... 18 more
       12.95 real        28.89 user         2.31 sys

Do you get that as well? Our commandline parsing library (args4j) is kind of a hassle, and I haven't found a way to make the parsed arguments be serializable. The solution to this is either to copy the args over to vals when they are referenced in a Spark serialized closure, or else copy all of the args over to case class (which is serializable).

timodonnell avatar Mar 15 '16 15:03 timodonnell

The solution to this is either to copy the args over to vals when they are referenced in a Spark serialized closure, or else copy all of the args over to case class (which is serializable).

Adding a with Serializable to Arguments would fix the ser error too, right?

ryan-williams avatar Mar 15 '16 15:03 ryan-williams

The serializable issue seems to be fixed with the suggested with Serializable addition to Arguments. I have also gone through and updated many of the function names/added docs to the functions. I uploaded a README to the docs folder outlining the MuTect algorithm and the differences I am aware of.


Review status: 0 of 17 files reviewed at latest revision, 3 unresolved discussions.


src/main/scala/org/hammerlab/guacamole/commands/SomaticMutectLikeCaller.scala, line 303 [r3] (raw file): Done.


src/main/scala/org/hammerlab/guacamole/commands/SomaticMutectLikeCaller.scala, line 479 [r3] (raw file): Done.


Comments from the review on Reviewable.io

jstjohn avatar Mar 15 '16 19:03 jstjohn