methylseq icon indicating copy to clipboard operation
methylseq copied to clipboard

dsl2 port

Open phue opened this issue 3 years ago • 15 comments

Continuation of https://github.com/nf-core/methylseq/pull/182 (now from a dedicated branch)

:warning: Breaking change:

The pipeline now requires a sample sheet to be passed to the pipeline with --input:

sample fastq_1 fastq_2 genome

See an example here

The genome column is optional, --genome continues to work as well

Known issues:

  • [x] ~workflow summary shows NAs~ -> (fixed by @ewels in https://github.com/nf-core/methylseq/pull/206)
  • [x] ~MultiQC report is missing some data~ -> (fixed in 0fd1614)
  • [x] ~-profile test_full needs a samplesheet.csv~ -> (fixed in https://github.com/nf-core/test-datasets/pull/232)

phue avatar Mar 24 '21 18:03 phue

workflow summary shows NAs

Maybe you mean as described in https://github.com/nf-core/tools/issues/971 ? In which case, upstream fix is required.

ewels avatar Mar 25 '21 21:03 ewels

Thanks for this update @phue ! It looks great 🎉 I did try this and found some errors/fixes

  1. Gives an error when few samples are single-end and others paired-end. Would be great if it can detect presence of second pair.
  2. Also, the output bam files in deduplicated folder inside bismark has multiple files with names sample_id_1_val_1_bismark_bt2_pe.deduplicated.bam and sample_id.deduplicated.sorted.bam . It is bit confusing and would be great if you can only put the final bam files in.
  3. There seems to be some error in $multicore variable in modules/local/software/bismark/align/main.nf .. it seems to resolve when I remove that.

Cheers, R

rbpisupati avatar Mar 26 '21 14:03 rbpisupati

Hi @rbpisupati,

thanks alot for testing this.

1. Gives an error when few samples are single-end and others paired-end. Would be great if it can detect presence of second pair.

Would you be able to share some more information on this? Where does the error occur exactly?

2. Also, the output bam files in `deduplicated` folder inside `bismark` has multiple files with names `sample_id_1_val_1_bismark_bt2_pe.deduplicated.bam` and `sample_id.deduplicated.sorted.bam` . It is bit confusing and would be great if you can only put the final bam files in.

Fixed in 5ad41fe (it copied both the sorted and the unsorted bam into the results directory. It should be enough to keep the former)

3. There seems to be some error in $multicore variable in `modules/local/software/bismark/align/main.nf` .. it seems to resolve when I remove that.

Fixed in df97fa1

phue avatar Mar 30 '21 07:03 phue

If I remember correctly, it is trying to do paired-end mapping with filename null since file path isnt present in the samplesheet.csv.

rbpisupati avatar Mar 30 '21 11:03 rbpisupati

If I remember correctly, it is trying to do paired-end mapping with filename null since file path isnt present in the samplesheet.csv.

I tried to reproduce it with this samplesheet but it works as expected. How does your samplesheet look like?

phue avatar Apr 03 '21 18:04 phue

nf-core lint overall result: Passed :white_check_mark:

Posted for pipeline commit 3785c0b

+| ✅ 158 tests passed       |+

:white_check_mark: Tests passed:

Run details

  • nf-core/tools version 2.6
  • Run at 2022-11-03 23:43:18

github-actions[bot] avatar Apr 06 '21 17:04 github-actions[bot]

Ack, I tried to bring this DSL2 branch back up to date with dev, but now tests are failing and I have no idea why 😰

@phue / anyone else have some time to look at this?

ewels avatar Jun 24 '21 12:06 ewels

ok I started trying to bring this up to date in https://github.com/nf-core/methylseq/pull/222

I did it as a PR to the dsl2 branch so that I / others can see the changes I've made to the original work from @phue. Hopefully once I get that working we can then move through the remaining PRs to a release relatively quickly.

ewels avatar Nov 04 '21 16:11 ewels

I think (hope) this is ready to merge into dev. I did make one substantial plumbing change - for the bismark workflow, I elected to sort the bam immediately after mapping. This mirrors the bwa-meth workflow and seems to make everything a bit more consistent. Also, this ensures that preseq receives sorted bam

njspix avatar May 20 '22 12:05 njspix

Hi @njspix,

thanks a bunch for picking this up and pushing it forward, great job!

Though I have to say I'm not particularly happy with the fact that the "one-sample-one-genome"-functionality that I put in to address https://github.com/nf-core/methylseq/issues/181 was lost along the way, I think we had agreed to disagree on that one (x-ref https://github.com/nf-core/modules/pull/129#issuecomment-774755102 https://github.com/nf-core/methylseq/pull/222#issuecomment-961871739).

I have a more serious concern however, about the changes made to the bismark workflow: If I remember correctly, the bismark tooling requires the bam file to be name-sorted, i.e bismark-methylation-extractor requires paired-end reads to consecutively follow each other in the input bam.

Did you test this on full-sized paired-end data (ideally with and without deduplication) ?

phue avatar May 25 '22 11:05 phue

@phue Thanks for reviewing! I am sorry about the per-sample genome functionality, I figured it was probably best to at least start with just basic functionality for a first release. I would be happy to help add in that feature later (I had some ideas about how to structure the channels so we wouldn't have to patch the aligner modules).

Thanks for raising the concern about bismark name sorting. I'll check and make sure.

njspix avatar May 25 '22 13:05 njspix

Yup, Bismark needs BAMs to be sorted by name: https://github.com/FelixKrueger/Bismark/tree/master/Docs#iii-bismark-deduplication-step

I also am partly to blame for dropping the per-sample functionality. I was pushing to not have local copies and made some headway with building new tooling to support patched versions of central modules. But life happened and I never finished it (maybe something else that I can delegate to @mirpedrol at some point!).

Let's keep it as a high priority to add in ASAP via either my or @njspix method. But for now the top priority must be DSL2 conversion and release I think 👍🏻

ewels avatar May 25 '22 13:05 ewels

Don't get me wrong I'm not blaming anyone here, I think what happened is that it got decided by popular vote a while ago. So fine with me, although I would have liked to see that functionality.

Maybe it will come up in other pipelines as well at some point and the necessary changes will make it to nf-core modules so it can be easily added back here (without the need for local modules to implement it)

phue avatar May 25 '22 14:05 phue

If we're making a wishlist, I'd love to handle different kits in the same workflow run! :grin: I think the samplesheet opens up for this in the future.

On the topic of #181, maybe we could add the additional fasta functionality in rnaseq. I cat'd them together manually ahead of time to make a reference.

edmundmiller avatar May 25 '22 14:05 edmundmiller

Yup, Bismark needs BAMs to be sorted by name: https://github.com/FelixKrueger/Bismark/tree/master/Docs#iii-bismark-deduplication-step

The previous commits correct this, and also a plumbing issue that was causing preseq to fail.

I also added in the trim_nexseq flag just for funsies...

njspix avatar Jul 07 '22 21:07 njspix

Let's do it! :shipit:

lightspeed

ewels avatar Nov 03 '22 23:11 ewels