MMseqs2
MMseqs2 copied to clipboard
Easy-linclust prints out representative sequences in duplicates
Expected Behavior
Here's the easy-linclust command I run: mmseqs easy-linclust ref.aa.fa.gz mmseq_clusters_50 tmp_dir --cov-mode 1 -c 0.8 --kmer-per-seq 80 --min-seq-id 0.5
In mmseq_clusters_50_rep_seq.fasta, we should expect having unique sequences & sequence headers
Current Behavior
grep ">" mmseq_clusters_50_rep_seq.fasta |wc -l 7230144
grep ">" mmseq_clusters_50_rep_seq.fasta | |sort |uniq |wc -l 7226281
The number of unique headers is lower than the number of representative sequences. It's not just headers that are duplicated - it's the same sequence as well that is duplicated.
Sequence headers were generated with Prodigal, and look like these: 1968UZ_k141_67882_1 # 76 # 315 # -1 # ID=11773_1;partial=00;start_type=ATG;rbs_motif=AAAA;rbs_spacer=7bp;gc_cont=0.425
Thank you very much for your help! Best, Mathieu