MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

filter clusters before MSA

Open fmaumusINRA opened this issue 3 years ago • 6 comments

Dear Soeding lab members, After sequence clustering, I am getting some very large clusters. I would like to obtain MSA for each cluster within reasonable computational time so I am trying to limit the size of each cluster before launching MSA. So I attempted to filter on tsv file and then to use tsv2db to reproduce a cluster db (while keeping groups with at least 5 sequences). I used the following commands. However the MSA only contains a single sequence repeated many many times suggesting that input is wrong. Do you have any suggestion to select only N members of a cluster for input to MSA? To select sequences in a meaningful way, Would there also be any option to select the most diverse members?

mmseqs easy-cluster --max-seq-len 15000 -c 0.5 FILE FILE.res /tmp/tmp_mmseqs/ awk '{ print $0 "\t" ++count[$1] }' FILE.res_cluster.tsv |awk '($3<50) {print $1 "\t" $2}' > FILE.res_cluster.tsv.max50 mmseqs tsv2db FILE.res_cluster.tsv.max50 FILE.res_cluster.tsv.max50.clus --output-dbtype 6 mmseqs createdb FILE FILE.db mmseqs createseqfiledb FILE.db FILE.res_cluster.tsv.max50.clus FILE.res_cluster.tsv.max50.clus.min5 --min-sequences 5 mmseqs apply FILE.res_cluster.tsv.max50.clus.min5 FILE.res_cluster.tsv.max50.clus.min5.msa -- mafft --adjustdirection -

Thanks a lot for your help (again)!

fmaumusINRA avatar May 24 '21 10:05 fmaumusINRA

We have the filterresult module since recently, which implements the HHblits alignment filtering algorithm, for slimming down MSAs by pairwise alignments. Here you want the --diff parameter to keep only the N most diverse sequences.

However you should stick to the non easy- modules to use this. E.g. something like:

mmseqs createdb FILE db
mmseqs cluster db clu tmp ...
mmseqs filterresult db db clu clu_filt --diff 100
mmseqs createseqfiledb db clu_filt fasDB
mmseqs apply fasDB mafftDB -- mafft ...

(This is assuming you are clustering proteins, not quite sure if filterresult works for nucleotide sequences).

milot-mirdita avatar May 27 '21 14:05 milot-mirdita

Thanks for your reply, I have tried the following commands but got "Segmentation fault" right away whenn launching filterresult. Any idea what could cause this?

mmseqs createdb FASTANUCLFILE FASTANUCLFILE.db mmseqs cluster FASTANUCLFILE.db FASTANUCLFILE.db.clu /tmp/tmp_mmseqs/ -c 0.5 -s 6 mmseqs filterresult FASTANUCLFILE.db FASTANUCLFILE.db FASTANUCLFILE.db.clu FASTANUCLFILE.db.clu_filt --diff 100 MMseqs Version: fb39ca1ee88313974f285740faa856ea68509193 Substitution matrix nucl:nucleotide.out,aa:blosum62.out Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Compositional bias 1 Allow deletions false Maximum seq. id. threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 100 Threads 32 Compressed 0 Verbosity 3

Query database size: 19948 type: Nucleotide Target database size: 19948 type: Nucleotide Segmentation fault ] 0.00% 1 eta -

fmaumusINRA avatar May 27 '21 19:05 fmaumusINRA

I suspected that you are doing this based on a nucleotide search. I spend a few minutes trying to fix filterresult to support nucleotide input, but it's not that easy. I'll have to think about what to do.

milot-mirdita avatar May 28 '21 13:05 milot-mirdita

Thanks a lot, it's very kind of you to spare time trying to make this possible. Yea, nucleotides again! ;)

fmaumusINRA avatar May 28 '21 13:05 fmaumusINRA

Hi, I was wondering if by chance you had any opportunity to progress on that. I am still interested! Thanks

fmaumusINRA avatar Jun 18 '21 19:06 fmaumusINRA

We are currently focusing on getting (protein) profile-profile searches ready, which is taking most of our development time. I hope we can try to improve nucleotide searches after that, but it's still a longer term goal right now.

I don't really have any better suggestions, except try looking for a faster MSA tool. Any MMseqs2 based solution I can think of right now would only work for protein sequences.

milot-mirdita avatar Jun 24 '21 13:06 milot-mirdita