NGSpeciesID icon indicating copy to clipboard operation
NGSpeciesID copied to clipboard

flexible clustering ?

Open omarkr8 opened this issue 2 years ago • 1 comments

Is there a way to adjust clustering parameters?

for example, some OTU pipelines will generate different number of bins depending on whether you want 98% or 95% similarity clusters. I do not see options for this for NGSpID. on that note, what IS the perc. threshold used here?

omarkr8 avatar Dec 07 '22 04:12 omarkr8

Yes, there are several parameters to adjust (NGSpeciesID uses isONclust for the clustering step). However, NGSpeciesID is not built for exact separation/clustering of sequences at a pre-determined exact similarity rate.

You can mimic very stringent clustering by setting large --k, lower --w and high --mapped_threshold and --aligned_threshold. --mapped_threshold and --aligned_threshold could be set to 0.9 and --k between 20 and 30 and --w between k+10 and k+30 if you are working with sequences without much errors.

The above suggestion will not work when your sequences have higher error rates than around 2%, such as in ONT long reads.

FYI, all of these parameters relates to the clustering:

  --k K                 Kmer size (default: 15)
  --w W                 Window size (default: 50)
  --min_shared MIN_SHARED
                        Minmum number of minimizers shared between read and cluster (default: 5)
  --mapped_threshold MAPPED_THRESHOLD
                        Minmum mapped fraction of read to be included in cluster. The density of minimizers to classify a region as mapped depends on quality of the read. (default: 0.7)
  --aligned_threshold ALIGNED_THRESHOLD
                        Minmum aligned fraction of read to be included in cluster. Aligned identity depends on the quality of the read. (default: 0.4)
  --min_fraction MIN_FRACTION
                        Minmum fraction of minimizers shared compared to best hit, in order to continue mapping. (default: 0.8)
  --min_prob_no_hits MIN_PROB_NO_HITS
                        Minimum probability for i consecutive minimizers to be different between read and representative and still considered as mapped region, under assumption that they come from the same transcript (depends on read quality).
                        (default: 0.1)

ksahlin avatar Dec 07 '22 17:12 ksahlin