Human and mouse read disambiguation for PDX samples?
Description of feature
Hi,
Thank you for creating this awesome pipeline. I'm wondering if there are any modules within sarek that do disambiguation of mouse and human reads for PDX samples? For example like the disambiguate tool from Astra Zeneca:
https://github.com/AstraZeneca-NGS/disambiguate
Thanks for your time and help.
Best, Asher
Hey!
Is this related to a similar preprocessing step as requested here: https://github.com/nf-core/sarek/issues/1144 ?
So far we have restrained from expanding the scope of sarek even further to keep the pipeline maintainable. If it is a single tool I am slightly more inclined to have it added. What else would be necessary to make this work in the current workflow?
Hi there,
This is related to the preprocessing step referenced in #1144.
Totally makes sense, I'm sure it takes a lot of time and effort to maintain. I was corresponding with @SPPearce about this on slack (link here), and he has written a subworkflow for this: https://nf-co.re/subworkflows/fastq_align_bamcmp_bwa. It relies on three tools: (i) bwa to align to both references, (ii) bamcmp to keep reads that align to the first genome, and (iii) sam tools to sort.
I haven't tested it out yet, but I think to integrate this for PDX or other samples with contamination this subworkflow would be run in lieu of the fastq_align_bwamem_mem2_dragmap_sentieon and bam_merge_index_samtools subworkflows. It could be an optional flag for these types of samples.
I would also be happy to try writing this in the next couple months, but I am thus far a nextflow novice :)
Thanks for your time and help!
Best, Asher
I do think we could do with this ability in some way, whether bamcmp or elsewhere. A suggestion was for a completely separate pipeline for this kind of filtering, generating bam files (or fastq) which then can go into many different pipelines
Hello,
is there any update on this issue?
I work in a bioinformatics facility and we use sarek for standard genomic analyses on human samples. We have some projects with PDX samples and so far we have been using BBSplit before running sarek to filter out mouse reads. It would be really useful if the tool was incorporated to sarek. Even though I'm only an nf-core user and I've never developed a pipeline, I don't think it would be very complicated to incorporate it as a single tool with one or two parameters, similarly to how it is already implemented in the nfcore rnaseq pipeline, which we also use routinely.
I'm happy to try to incorporate it myself following the contribution guidelines if this is something that could be interesting for other users.
Thanks and best,
Alba
I support this, it is a commonly asked for addition. I think bbsplit works on fastq files right? So it would be relatively straightforward to implement in the same way as in rnaseq.
Hey! If it is a single tool I am inclined to agree to adding it. We are meeting Mondays to discuss ongoing dev work and talk about development in #sarek_dev. You are welcome to join if you want to give it a try :)
Great, I'll give it a try. I just joined the #sarek_dev channel!
Wanted to add that I will be excited to use this feature if included! BBsplit would be great, I have also used xengsort (https://gitlab.com/genomeinformatics/xengsort) which was recommended to me on the #sarek_dev channel a while back and it worked similarly well.
Cheers, Asher
PS Also happy to help out with any dev if needed, or at the very least I would be happy to beta test the feature :)
Are you all joining the hackathon? This would be a nice project I think. And we would be around to help review
I would be happy to join if I'm available, and others from my team may be keen to join as well. When is the hackathon? Sorry if I've missed this on the slack channel.
https://nf-co.re/events/2025/hackathon-march-2025
it's merged! 🚀
This is wonderful, thanks Friederike!