rnaseq Check for unexpected species contaminants

Especially with low-input non-human samples, a useful QC step would be to screen for unexpected species (e.g. human, ecoli)

https://github.com/csawye01/nf-core-demultiplex/blob/d336df6dc08caed44505e286f9a47e386986d50c/main.nf#L53
https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

Aug 23 '19 15:08 cutsort

Would using sourmash (https://github.com/dib-lab/sourmash) help with reducing database size as it subsamples the k-mers from the database using MinHash?

Aug 23 '19 15:08 olgabot

I suspect a k-mer based approach is like to be the most efficient way of screening for contaminants in a way where you can pass a relatively unbiased and large database of organisms.

@olgabot Have there been any comparisons made between kraken2 and sourmash? They seem to have similar applications? kraken2 is written in C++ and is quite rapid whereas sourmash is written in Python with some optimisation Im assuming?

Aug 23 '19 15:08 drpatelh

I use kraken2 in the nf-core/bacass pipeline to check for potential contamination prior to doing the assembly, that works quite nicely though you, of course, need to specify a database still.

Sep 04 '19 09:09 apeltzer

Some information about Sourmash: https://github.com/dib-lab/sourmash/issues/725

And their paper: https://f1000research.com/articles/8-1006

Sep 05 '19 18:09 olgabot

If (in addition to rRNA removal as suggested in #227) an optional step would be added that would even remove all reads from a particular species, e.g. human, than this pipeline might be able to also efficiently analyze metatranscriptomics from human samples.

Sep 18 '19 14:09 d4straub

Very interesting! What would you describe as the best way to do host removal?

On Wed, Sep 18, 2019, 16:48 Daniel Straub [email protected] wrote:

If (in addition to rRNA removal as suggested in #227 https://github.com/nf-core/rnaseq/issues/227) an optional step would be added that would even remove all reads from a particular species, e.g. human, than this pipeline might be able to also efficiently analyze human metatranscriptomics.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nf-core/rnaseq/issues/271?email_source=notifications&email_token=AAGE24EC737VTVGHGBNWAITQKI5UNA5CNFSM4IPAUCPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7AKQEI#issuecomment-532719633, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGE24GDMVZT7MQ3GVN7YQDQKI5UNANCNFSM4IPAUCPA .

Sep 20 '19 01:09 olgabot

I am myself not involved in metagenomics of human samples but environmental samples, so my ideas have to be taken with a little caution.

The simplest solution would be using the host genome and forward all unmapped reads (use mapper of choice) for analysis. However, non-host sequences similar to the host could be lost as well in the process. This could be minimized by using strategies such as KRAKEN2 on relevant references (e.g. human + bacteria and remove all that are annotated as human) or DIAMOND (e.g. on whole Ensembl).

Here is an example where contaminant reads were removed by bowtie mapping (in the tool KneadData) to focus on the endogenous E. coli strain.

Sep 20 '19 08:09 d4straub

rnaseq rnaseq copied to clipboard

Check for unexpected species contaminants

rnaseq
rnaseq copied to clipboard