MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Easy-linclust prints out representative sequences in duplicates

Open mgroussi opened this issue 2 years ago • 0 comments

Expected Behavior

Here's the easy-linclust command I run: mmseqs easy-linclust ref.aa.fa.gz mmseq_clusters_50 tmp_dir --cov-mode 1 -c 0.8 --kmer-per-seq 80 --min-seq-id 0.5

In mmseq_clusters_50_rep_seq.fasta, we should expect having unique sequences & sequence headers

Current Behavior

grep ">" mmseq_clusters_50_rep_seq.fasta |wc -l 7230144

grep ">" mmseq_clusters_50_rep_seq.fasta | |sort |uniq |wc -l 7226281

The number of unique headers is lower than the number of representative sequences. It's not just headers that are duplicated - it's the same sequence as well that is duplicated.

Sequence headers were generated with Prodigal, and look like these: 1968UZ_k141_67882_1 # 76 # 315 # -1 # ID=11773_1;partial=00;start_type=ATG;rbs_motif=AAAA;rbs_spacer=7bp;gc_cont=0.425

Thank you very much for your help! Best, Mathieu

mgroussi avatar Apr 15 '22 07:04 mgroussi