deepTools
deepTools copied to clipboard
Galaxy plotEnrichment wrapper (arguably) does not handle collections properly
Relevant File: /galaxy/wrapper/plotEnrichment.xml
Issue:
The Galaxy deeptools plotEnrichment tool takes two files, a BAM file and a BED file, and finds reads in the BAM file in the regions specified by the BED file.
When you have multiple BAMs and associated BEDs in a collection (e.g. reads from ChIP-seq and associated peaks called from MACS2) and try to plot enrichment, the Galaxy deeptools plotEnrichment tool plots enrichment for every combination of BAMs and BEDs. Instead, we would prefer if it zipped the BAMs and BEDs and only plotted enrichments for every index into each list.
Details:
currently:
BAM_collection = [1.bam, 2.bam]
BED_collection = [1.bed, 2.bed]
plotEnrichment(BAM_collection, BED_collection)
=> [plot(1.bam, 1.bed), plot(1.bam, 2.bed), plot(2.bam, 1.bed), plot(2.bam, 2.bed)]
desired:
BAM_collection = [1.bam, 2.bam]
BED_collection = [1.bed, 2.bed]
plotEnrichment(BAM_collection, BED_collection)
=> [plot(1.bam, 1.bed), plot(2.bam, 2.bed)]
As our group moves to larger collections of experiments, this is causing a lot of extra computing (imagine a 30-element collection causing 900 jobs instead of 30).
If the maintainers on this repository don't believe that the desired behavior should not be standard, then we propose adding an option for it in the Advanced Options.
(cc @iamjli)
@bgruening Is it possible to iterate over joint collections like this in Galaxy wrappers?
It is possible via the API as far as I remember, but not via the UI. @jmchilton would know better.
I should note we're also having this same issue with deeptools' plotHeatmap, though it has been less of an issue so far. But we would love to have an option in plotHeatmap to zip collections in this same manner.
Doing this with plotHeatmap would be a bad idea, since the plots would incorrectly appear to be comparable. I would suggest creating separate heatmaps in that case.
@dpryan79 the "correct" behavior is ambiguous here of course. What we want to see is a heatmap of all our samples for one set of regions. So we pass a collection of bams and a single bed, and we get a collection of identical plots (each containing every bam), each generated separately.
current behavior
BAM_collection = [1.bam, 2.bam, 3.bam]
BED_file = regions.bed
plotHeatmap(BAM_collection, BED_file)
=> [plot([1.bam, 2.bam, 3.bam], regions.bed),
plot([1.bam, 2.bam, 3.bam], regions.bed),
plot([1.bam, 2.bam, 3.bam], regions.bed)]
plotHeatmaps is expensive, so you can imagine this is sort of painful for large collections.
desired behavior
BAM_collection = [1.bam, 2.bam, 3.bam]
BED_file = regions.bed
plotHeatmap(BAM_collection, BED_file)
=> [plot([1.bam, 2.bam, 3.bam], regions.bed)]
Ah, gotcha. You should be able to select a collection as input for the score files in computeMatrix and get exactly that when you then run plotHeatmap on the output. Otherwise yes I should change that if possible.
So it is a tool issue then? Do I need to weigh in still 😆? If I do...
https://github.com/galaxyproject/galaxy/issues/4623 might have some more context on the Galaxy issues.
If you pass two collections to a tool that takes in a single data parameter - it will zip the collections. But this tool has multiple="true" data parameters - so... I guess what you do with multiple collections is sort of up to the tool author? The tool has indicated it wants all the files - so the tool can zip, cross product, concatenate - whatever I suppose. If you want Galaxy to do the zipping and to get multiple jobs - you have to wrap the inputs in conditionals and have a case that consumes a single data parameter I think until #4623 is resolved.
If I'm missing something and there are other Galaxy issues then #4623 involved here let me know. Otherwise - I do really need to do that - it is on the collections roadmap https://github.com/galaxyproject/galaxy/projects/7) (and the very top thing on @jmchilton's collection Plan for 2018 card).
You should be able to select a collection as input for the score files in computeMatrix and get exactly that when you then run plotHeatmap on the output. Otherwise yes I should change that if possible.
@dpryan79 this might be an issue on our end, but we've tried a couple different techniques for computeMatrix and still get this result for plotHeatmap. If there's something in particular we should try for computeMatrix, let me know. As mentioned though, plotEnrichment is the bigger issue for us right now.
@jmchilton I'm not strictly sure this is a Galaxy issue, I believe from what I know this is a deeptools galaxy wrapper issue.
the tool can zip, cross product, concatenate - whatever I suppose.
What I'm asking for is for us to define the appropriate default behavior for plotEnrichment and plotHeatmaps when each of these tools are passed any combination of collections and individual files for each of their inputs, and then potentially also provide the other reasonable options we can think of as flags in the galaxy tool wrappers.
I claim that the current behavior of plotHeatmaps is not a good default:
BAM_collection = [1.bam, 2.bam, 3.bam]
BED_file = regions.bed
plotHeatmap(BAM_collection, BED_file)
=> [plot([1.bam, 2.bam, 3.bam], regions.bed),
plot([1.bam, 2.bam, 3.bam], regions.bed),
plot([1.bam, 2.bam, 3.bam], regions.bed)]
The current behavior for plotEnrichment might be a good default
BAM_collection = [1.bam, 2.bam]
BED_collection = [1.bed, 2.bed]
plotEnrichment(BAM_collection, BED_collection)
=> [plot(1.bam, 1.bed), plot(1.bam, 2.bed), plot(2.bam, 1.bed), plot(2.bam, 2.bed)]
But we would also like the option to have this pattern:
BAM_collection = [1.bam, 2.bam]
BED_collection = [1.bed, 2.bed]
plotEnrichment(BAM_collection, BED_collection)
=> [plot(1.bam, 1.bed), plot(2.bam, 2.bed)]
@dpryan79 it seems like you wrote the plotEnrichment galaxy wrapper. Does this sound reasonable / feasible to you?
@dpryan79 this issue is causing us ever-more painful over-computing. I'd love if we could resolve this sometime in the next couple weeks.
I'm back from vacation now so I'll hopefully have time to implement something like this in the near future.
@dpryan79 Any progress?
@zfrenchee Sorry no, I've been busy with other things. I'm hoping to handle something in this regard for the 3.2 release, which has no ETA at this point.
@dpryan79 this would still be very valuable to us.