MMseqs2
MMseqs2 copied to clipboard
filter clusters before MSA
Dear Soeding lab members, After sequence clustering, I am getting some very large clusters. I would like to obtain MSA for each cluster within reasonable computational time so I am trying to limit the size of each cluster before launching MSA. So I attempted to filter on tsv file and then to use tsv2db to reproduce a cluster db (while keeping groups with at least 5 sequences). I used the following commands. However the MSA only contains a single sequence repeated many many times suggesting that input is wrong. Do you have any suggestion to select only N members of a cluster for input to MSA? To select sequences in a meaningful way, Would there also be any option to select the most diverse members?
mmseqs easy-cluster --max-seq-len 15000 -c 0.5 FILE FILE.res /tmp/tmp_mmseqs/ awk '{ print $0 "\t" ++count[$1] }' FILE.res_cluster.tsv |awk '($3<50) {print $1 "\t" $2}' > FILE.res_cluster.tsv.max50 mmseqs tsv2db FILE.res_cluster.tsv.max50 FILE.res_cluster.tsv.max50.clus --output-dbtype 6 mmseqs createdb FILE FILE.db mmseqs createseqfiledb FILE.db FILE.res_cluster.tsv.max50.clus FILE.res_cluster.tsv.max50.clus.min5 --min-sequences 5 mmseqs apply FILE.res_cluster.tsv.max50.clus.min5 FILE.res_cluster.tsv.max50.clus.min5.msa -- mafft --adjustdirection -
Thanks a lot for your help (again)!
We have the filterresult
module since recently, which implements the HHblits alignment filtering algorithm, for slimming down MSAs by pairwise alignments. Here you want the --diff
parameter to keep only the N most diverse sequences.
However you should stick to the non easy-
modules to use this.
E.g. something like:
mmseqs createdb FILE db
mmseqs cluster db clu tmp ...
mmseqs filterresult db db clu clu_filt --diff 100
mmseqs createseqfiledb db clu_filt fasDB
mmseqs apply fasDB mafftDB -- mafft ...
(This is assuming you are clustering proteins, not quite sure if filterresult
works for nucleotide sequences).
Thanks for your reply, I have tried the following commands but got "Segmentation fault" right away whenn launching filterresult. Any idea what could cause this?
mmseqs createdb FASTANUCLFILE FASTANUCLFILE.db mmseqs cluster FASTANUCLFILE.db FASTANUCLFILE.db.clu /tmp/tmp_mmseqs/ -c 0.5 -s 6 mmseqs filterresult FASTANUCLFILE.db FASTANUCLFILE.db FASTANUCLFILE.db.clu FASTANUCLFILE.db.clu_filt --diff 100 MMseqs Version: fb39ca1ee88313974f285740faa856ea68509193 Substitution matrix nucl:nucleotide.out,aa:blosum62.out Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Compositional bias 1 Allow deletions false Maximum seq. id. threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 100 Threads 32 Compressed 0 Verbosity 3
Query database size: 19948 type: Nucleotide Target database size: 19948 type: Nucleotide Segmentation fault ] 0.00% 1 eta -
I suspected that you are doing this based on a nucleotide search. I spend a few minutes trying to fix filterresult
to support nucleotide input, but it's not that easy. I'll have to think about what to do.
Thanks a lot, it's very kind of you to spare time trying to make this possible. Yea, nucleotides again! ;)
Hi, I was wondering if by chance you had any opportunity to progress on that. I am still interested! Thanks
We are currently focusing on getting (protein) profile-profile searches ready, which is taking most of our development time. I hope we can try to improve nucleotide searches after that, but it's still a longer term goal right now.
I don't really have any better suggestions, except try looking for a faster MSA tool. Any MMseqs2 based solution I can think of right now would only work for protein sequences.