MAGpurify icon indicating copy to clipboard operation
MAGpurify copied to clipboard

Interpretation of cutoff values

Open donovan-h-parks opened this issue 1 year ago • 2 comments

Hi,

Can you clarify what the cutoff values are for gc-content and tetra-freq and how these were established? My guess is that for gc-content the cutoff of 15.75 means that only contigs that deviate from the mean GC content by more than this value are flagged as contaminated. This seems like a very, very conservative value though (e.g., mean GC of 50% only flags contigs at <34.25% or >65.75%?).

I appreciate that the tetra-freq measure if more abstract, so I'm more interested in how the 0.06 default was established.

Thanks, Donovan

donovan-h-parks avatar Feb 27 '23 21:02 donovan-h-parks

@apcamargo I'm quite interested to learn about these cutoffs as well. In v2, are these cutoffs the same or are they specific to each dataset?

adityabandla avatar Feb 15 '24 01:02 adityabandla

The cutoffs in v2 are based on a classification model that I trained on sets of simulated genomes. The cutoffs for v1 were decided based on the methology described in this manuscript.

apcamargo avatar Feb 15 '24 16:02 apcamargo