MAGpurify
MAGpurify copied to clipboard
Interpretation of cutoff values
Hi,
Can you clarify what the cutoff values are for gc-content
and tetra-freq
and how these were established? My guess is that for gc-content
the cutoff of 15.75 means that only contigs that deviate from the mean GC content by more than this value are flagged as contaminated. This seems like a very, very conservative value though (e.g., mean GC of 50% only flags contigs at <34.25% or >65.75%?).
I appreciate that the tetra-freq
measure if more abstract, so I'm more interested in how the 0.06
default was established.
Thanks, Donovan
@apcamargo I'm quite interested to learn about these cutoffs as well. In v2, are these cutoffs the same or are they specific to each dataset?
The cutoffs in v2 are based on a classification model that I trained on sets of simulated genomes. The cutoffs for v1 were decided based on the methology described in this manuscript.