seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

Does rmdup remove by identical sequences or substring patterns?

Open sophiachen1 opened this issue 3 months ago • 4 comments

Hi, I have a protein fasta file and used rmdup to remove duplicated sequence (I expect to remove completely identical sequences) in the file. However, I find in my output file, it also removed a few short sequences that are substrings of another sequence. I am wondering if rmdup removes by substring pattern?

Thanks, Sophia

sophiachen1 avatar Oct 09 '25 20:10 sophiachen1

add --by-seq, see also the full usage: https://bioinf.shenwei.me/seqkit/usage/#rmdup

shenwei356 avatar Oct 10 '25 01:10 shenwei356

Hi Wei, Thanks for your quick response. I have added -s option in my command and observed the short sequence removal and I am using the most recent version of seqkit.

sophiachen1 avatar Oct 10 '25 01:10 sophiachen1

Would you please send the removed short sequence and the long one (or simply the whole file) to me, here or by email: [email protected].

shenwei356 avatar Oct 10 '25 02:10 shenwei356

any update?

shenwei356 avatar Nov 19 '25 09:11 shenwei356