rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

Check for unexpected species contaminants

Open cutsort opened this issue 6 years ago • 7 comments

Especially with low-input non-human samples, a useful QC step would be to screen for unexpected species (e.g. human, ecoli)

  • https://github.com/csawye01/nf-core-demultiplex/blob/d336df6dc08caed44505e286f9a47e386986d50c/main.nf#L53
  • https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/

cutsort avatar Aug 23 '19 15:08 cutsort

Would using sourmash (https://github.com/dib-lab/sourmash) help with reducing database size as it subsamples the k-mers from the database using MinHash?

olgabot avatar Aug 23 '19 15:08 olgabot

I suspect a k-mer based approach is like to be the most efficient way of screening for contaminants in a way where you can pass a relatively unbiased and large database of organisms.

@olgabot Have there been any comparisons made between kraken2 and sourmash? They seem to have similar applications? kraken2 is written in C++ and is quite rapid whereas sourmash is written in Python with some optimisation Im assuming?

drpatelh avatar Aug 23 '19 15:08 drpatelh

I use kraken2 in the nf-core/bacass pipeline to check for potential contamination prior to doing the assembly, that works quite nicely though you, of course, need to specify a database still.

apeltzer avatar Sep 04 '19 09:09 apeltzer

Some information about Sourmash: https://github.com/dib-lab/sourmash/issues/725

And their paper: https://f1000research.com/articles/8-1006

olgabot avatar Sep 05 '19 18:09 olgabot

If (in addition to rRNA removal as suggested in #227) an optional step would be added that would even remove all reads from a particular species, e.g. human, than this pipeline might be able to also efficiently analyze metatranscriptomics from human samples.

d4straub avatar Sep 18 '19 14:09 d4straub

Very interesting! What would you describe as the best way to do host removal?

On Wed, Sep 18, 2019, 16:48 Daniel Straub [email protected] wrote:

If (in addition to rRNA removal as suggested in #227 https://github.com/nf-core/rnaseq/issues/227) an optional step would be added that would even remove all reads from a particular species, e.g. human, than this pipeline might be able to also efficiently analyze human metatranscriptomics.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nf-core/rnaseq/issues/271?email_source=notifications&email_token=AAGE24EC737VTVGHGBNWAITQKI5UNA5CNFSM4IPAUCPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7AKQEI#issuecomment-532719633, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGE24GDMVZT7MQ3GVN7YQDQKI5UNANCNFSM4IPAUCPA .

olgabot avatar Sep 20 '19 01:09 olgabot

I am myself not involved in metagenomics of human samples but environmental samples, so my ideas have to be taken with a little caution.

The simplest solution would be using the host genome and forward all unmapped reads (use mapper of choice) for analysis. However, non-host sequences similar to the host could be lost as well in the process. This could be minimized by using strategies such as KRAKEN2 on relevant references (e.g. human + bacteria and remove all that are annotated as human) or DIAMOND (e.g. on whole Ensembl).

Here is an example where contaminant reads were removed by bowtie mapping (in the tool KneadData) to focus on the endogenous E. coli strain.

d4straub avatar Sep 20 '19 08:09 d4straub