taxprofiler
taxprofiler copied to clipboard
Consider default parameters for profilers
Description of feature
I don't have a lot of experience with all the tools so I can only speak for kraken2 and Bracken:
kraken2
- It can be disastrous to not set the
--confidence
parameter, see this discussion and other issues on the repo. So my vote is to use a rather strict value. - A newly introduced paramter
--minimum-hit-groups
should also be set, see here.
Bracken
- Bracken has the parameter
-t
which sets the minimum number of reads that kraken2 must have assigned for the taxon to be considered in the redistribution of reads. This is set to 10 by default which is well within the typical numbers of reads that I see as false positives assigned by kraken2.
Please add your points of view and expand for other tools where you have experience.
For MALT (the only one I really have much experience wiht), which is presuambly only going to be used for by people in aDNA I would use the settings here:
The maximal E-value was set to 1.0. The maximal number of alignments for each query was set to 100. The minimal percent identity was set to 85. The number of threads was set to 32. The alignment type of MALT was set to Local in order to be comparable to the other programs.
But possibly increase % identity to 90
https://www.biorxiv.org/content/10.1101/050559v1.full
I do not have experience with KrakenUniq but I came across this discussion in their github repo
https://github.com/fbreitwieser/krakenuniq/issues/112
Just saw this in our config:
shortread_qc_minlength = 15
Considering that the k-mer-based profilers use a k-mer length of around 35 by default, this is way too short. Maybe a default of 45 or so?
I'm not sure about this, some of the tools are alignment and which could still be valid (I think I based 15 off the default of one of the tools... but I can't remember unfortunately now...)
Regarding kaiju
:
https://github.com/bioinformatics-centre/kaiju/issues/209#issuecomment-1020019707
The conclusion from your link to Kaiju would be to run it in non-greedy mode? Or at least play around with that to see what difference it makes.
I was thinking of playing around first and if we should consider a default parameter for kaiju.
Description of feature
I don't have a lot of experience with all the tools so I can only speak for kraken2 and Bracken:
kraken2
- It can be disastrous to not set the
--confidence
parameter, see this discussion and other issues on the repo. So my vote is to use a rather strict value.- A newly introduced paramter
--minimum-hit-groups
should also be set, see here.Bracken
- Bracken has the parameter
-t
which sets the minimum number of reads that kraken2 must have assigned for the taxon to be considered in the redistribution of reads. This is set to 10 by default which is well within the typical numbers of reads that I see as false positives assigned by kraken2.Please add your points of view and expand for other tools where you have experience.
For kraken2
Confidence
I have tested kraken2 with a set of confidence values (0, 0.1, 0.3, 0.5, 0.7, 0.9) for a validated dataset in which we know the true number of reads of the target virus.
The results indicate confidence=0.1 would be a best choice for samples with low concentration; confidence>0.1 usually fails to identify targeted virus; confidence=0 also is not so bad at performing identification but with higher false positives.
For viruses with high concentration, confidence 0, 0.1, 0.3, 0.5 even 0.7 performs quite well. Lower confidence assigned more reads to targeting viruses, but still lower than the true value.
So I would like to go for confidence=0.1 if the research question is just to identify certain viruses.
--minimum-hit-groups
The default value is 2 in kraken.v 2.1.2.
This value could vary quite a lit based on research questions. If you have a highly diverse dataset or are interested in detecting rare taxa, you may want to use a lower value for --minimum-hit-groups to increase sensitivity. On the other hand, if you have a relatively simple dataset or are interested in detecting only highly abundant taxa, you may want to choose a higher value to increase specificity.
I think default=2 is a good option in general.
From slack:
From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools
https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000949
I'm not quite sure about the minimum hit groups right now but I just stumbled across an example read assignment that makes me even more sure that a low confidence value is a bad idea.
C A01136:446:H7J2YDSX5:4:2362:26078:14246 1173 101|101 18:46 22:8 18:7 22:5 1173:1 |:| 0:13 18:5 0:6 18:43
You can see here that a single k-mer is assigned to taxon 1173 whereas the majority is clearly taxon 18.
I agree that --minlength = 15 is too short. A value of around 30-35 could be a good consideration. It was set to 50 for Illumina data in nf-core/viralrecon. There are also a few cases which used 30. For instance: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1927-y https://www.diva-portal.org/smash/get/diva2:1607219/FULLTEXT01.pdf
I'm somewhat confused by your comment since --minlength
is yet another option that we didn't discuss so far.
I apologize for any confusion. In your previous comments, you mentioned the parameter shortread_qc_minlength = 15, which is used in both Fastp and AdapterRemoval to specify the minimum length of reads to be retained after quality control filtering. I have read some articles about how this value is chosen.
Okay, now I see which comment you're referring to 🙂