NGSpeciesID
NGSpeciesID copied to clipboard
flexible clustering ?
Is there a way to adjust clustering parameters?
for example, some OTU pipelines will generate different number of bins depending on whether you want 98% or 95% similarity clusters. I do not see options for this for NGSpID. on that note, what IS the perc. threshold used here?
Yes, there are several parameters to adjust (NGSpeciesID uses isONclust for the clustering step). However, NGSpeciesID is not built for exact separation/clustering of sequences at a pre-determined exact similarity rate.
You can mimic very stringent clustering by setting large --k
, lower --w
and high --mapped_threshold
and --aligned_threshold
. --mapped_threshold
and --aligned_threshold
could be set to 0.9 and --k
between 20 and 30 and --w
between k+10
and k+30
if you are working with sequences without much errors.
The above suggestion will not work when your sequences have higher error rates than around 2%, such as in ONT long reads.
FYI, all of these parameters relates to the clustering:
--k K Kmer size (default: 15)
--w W Window size (default: 50)
--min_shared MIN_SHARED
Minmum number of minimizers shared between read and cluster (default: 5)
--mapped_threshold MAPPED_THRESHOLD
Minmum mapped fraction of read to be included in cluster. The density of minimizers to classify a region as mapped depends on quality of the read. (default: 0.7)
--aligned_threshold ALIGNED_THRESHOLD
Minmum aligned fraction of read to be included in cluster. Aligned identity depends on the quality of the read. (default: 0.4)
--min_fraction MIN_FRACTION
Minmum fraction of minimizers shared compared to best hit, in order to continue mapping. (default: 0.8)
--min_prob_no_hits MIN_PROB_NO_HITS
Minimum probability for i consecutive minimizers to be different between read and representative and still considered as mapped region, under assumption that they come from the same transcript (depends on read quality).
(default: 0.1)