taxprofiler icon indicating copy to clipboard operation
taxprofiler copied to clipboard

Consider default parameters for profilers

Open Midnighter opened this issue 2 years ago • 14 comments

Description of feature

I don't have a lot of experience with all the tools so I can only speak for kraken2 and Bracken:

kraken2

  • It can be disastrous to not set the --confidence parameter, see this discussion and other issues on the repo. So my vote is to use a rather strict value.
  • A newly introduced paramter --minimum-hit-groups should also be set, see here.

Bracken

  • Bracken has the parameter -t which sets the minimum number of reads that kraken2 must have assigned for the taxon to be considered in the redistribution of reads. This is set to 10 by default which is well within the typical numbers of reads that I see as false positives assigned by kraken2.

Please add your points of view and expand for other tools where you have experience.

Midnighter avatar Oct 21 '22 09:10 Midnighter

For MALT (the only one I really have much experience wiht), which is presuambly only going to be used for by people in aDNA I would use the settings here:

The maximal E-value was set to 1.0. The maximal number of alignments for each query was set to 100. The minimal percent identity was set to 85. The number of threads was set to 32. The alignment type of MALT was set to Local in order to be comparable to the other programs.

But possibly increase % identity to 90

https://www.biorxiv.org/content/10.1101/050559v1.full

jfy133 avatar Oct 24 '22 14:10 jfy133

I do not have experience with KrakenUniq but I came across this discussion in their github repo

https://github.com/fbreitwieser/krakenuniq/issues/112

sofstam avatar Oct 25 '22 08:10 sofstam

Just saw this in our config:

    shortread_qc_minlength           = 15

Considering that the k-mer-based profilers use a k-mer length of around 35 by default, this is way too short. Maybe a default of 45 or so?

Midnighter avatar Nov 03 '22 10:11 Midnighter

I'm not sure about this, some of the tools are alignment and which could still be valid (I think I based 15 off the default of one of the tools... but I can't remember unfortunately now...)

jfy133 avatar Nov 03 '22 11:11 jfy133

Regarding kaiju:

https://github.com/bioinformatics-centre/kaiju/issues/209#issuecomment-1020019707

sofstam avatar Nov 04 '22 11:11 sofstam

The conclusion from your link to Kaiju would be to run it in non-greedy mode? Or at least play around with that to see what difference it makes.

Midnighter avatar Nov 04 '22 11:11 Midnighter

I was thinking of playing around first and if we should consider a default parameter for kaiju.

sofstam avatar Nov 04 '22 11:11 sofstam

Description of feature

I don't have a lot of experience with all the tools so I can only speak for kraken2 and Bracken:

kraken2

  • It can be disastrous to not set the --confidence parameter, see this discussion and other issues on the repo. So my vote is to use a rather strict value.
  • A newly introduced paramter --minimum-hit-groups should also be set, see here.

Bracken

  • Bracken has the parameter -t which sets the minimum number of reads that kraken2 must have assigned for the taxon to be considered in the redistribution of reads. This is set to 10 by default which is well within the typical numbers of reads that I see as false positives assigned by kraken2.

Please add your points of view and expand for other tools where you have experience.

For kraken2

Confidence

I have tested kraken2 with a set of confidence values (0, 0.1, 0.3, 0.5, 0.7, 0.9) for a validated dataset in which we know the true number of reads of the target virus.

The results indicate confidence=0.1 would be a best choice for samples with low concentration; confidence>0.1 usually fails to identify targeted virus; confidence=0 also is not so bad at performing identification but with higher false positives.

For viruses with high concentration, confidence 0, 0.1, 0.3, 0.5 even 0.7 performs quite well. Lower confidence assigned more reads to targeting viruses, but still lower than the true value.

So I would like to go for confidence=0.1 if the research question is just to identify certain viruses.

--minimum-hit-groups

The default value is 2 in kraken.v 2.1.2.

This value could vary quite a lit based on research questions. If you have a highly diverse dataset or are interested in detecting rare taxa, you may want to use a lower value for --minimum-hit-groups to increase sensitivity. On the other hand, if you have a relatively simple dataset or are interested in detecting only highly abundant taxa, you may want to choose a higher value to increase specificity.

I think default=2 is a good option in general.

LilyAnderssonLee avatar Mar 01 '23 08:03 LilyAnderssonLee

From slack: From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000949

sofstam avatar Mar 06 '23 08:03 sofstam

I'm not quite sure about the minimum hit groups right now but I just stumbled across an example read assignment that makes me even more sure that a low confidence value is a bad idea.

C	A01136:446:H7J2YDSX5:4:2362:26078:14246	1173	101|101	18:46 22:8 18:7 22:5 1173:1 |:| 0:13 18:5 0:6 18:43

You can see here that a single k-mer is assigned to taxon 1173 whereas the majority is clearly taxon 18.

Midnighter avatar Mar 09 '23 01:03 Midnighter

I agree that --minlength = 15 is too short. A value of around 30-35 could be a good consideration. It was set to 50 for Illumina data in nf-core/viralrecon. There are also a few cases which used 30. For instance: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1927-y https://www.diva-portal.org/smash/get/diva2:1607219/FULLTEXT01.pdf

LilyAnderssonLee avatar Mar 16 '23 11:03 LilyAnderssonLee

I'm somewhat confused by your comment since --minlength is yet another option that we didn't discuss so far.

Midnighter avatar Mar 16 '23 12:03 Midnighter

I apologize for any confusion. In your previous comments, you mentioned the parameter shortread_qc_minlength = 15, which is used in both Fastp and AdapterRemoval to specify the minimum length of reads to be retained after quality control filtering. I have read some articles about how this value is chosen.

LilyAnderssonLee avatar Mar 16 '23 12:03 LilyAnderssonLee

Okay, now I see which comment you're referring to 🙂

Midnighter avatar Mar 16 '23 13:03 Midnighter