seqtk icon indicating copy to clipboard operation
seqtk copied to clipboard

cutN penalty to identify all Ns?

Open nikostr opened this issue 3 years ago • 0 comments

I'm interested in running cutN to identify all regions of Ns in my sequence. If I'm understanding the code correctly, regions of Ns are interrupted if the score becomes negative, and score corresponds to number of Ns - number of non-Ns * penalty. A penalty of zero gives a region starting from the first N and going to the end of the sequence, and small penalties lead to regions of Ns being merged, with the non-N sequences being discarded. To ensure exact regions of Ns, the penalty needs to be sufficient to always be bigger than the contiguous number of Ns prior to the first non-N, with a too small penalty leading to regions of Ns being merged. Am I understanding this correctly? Would it make sense to have a way of explicitly extracting all contiguous regions of Ns? This could perhaps be done by having reserved penalty values (e.g. 0 or 1000000000), or by adding a flag to support this behavior?

nikostr avatar Jan 27 '22 13:01 nikostr