MMseqs2 too few hits

too few hits

Open fmaumusINRA opened this issue 3 years ago • 9 comments

Dear Soeding lab members, I was trying to use mmseqs for tblastn-like analysis. With a given aa query db and nt targetdb, I have results of blastall and blast+ giving approx 11k and 14k hits (evalue 0.001), respectively. Aiming at speeding this process, I wanted to use easy-search. At first, I was getting only about 500-700 hits with default settings and varying sensitivity. I then tried out the "--max-seqs" option --max-seqs INT Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [0.000] It seems that by default the max seq number is set to 300. When I switched this to --max-seqs 100000, I am now getting 2.5K hits. mmseqs easy-search prt.fa genome.fa out /tmp/ -s 7 --max-seqs 100000 That's much better but still far from blast. Would you have any suggestion to address this discrepancy?

Thanks a lot for your help !

May 17 '21 19:05 fmaumusINRA

We don't (/can't) generate similar k-mers for nucleotide searches, so the sensitivity parameter doesn't really affect anything in this case. You could try reducing the k-mer size a bit, that might result in more hits getting passed through the prefilter.

May 18 '21 12:05 milot-mirdita

Thanks a lot for your response. I tried reducing with "-k 5" which improved a bit and then got an error when trying "-k 4". But isn't it doing a BLASTP against translated ORFs anyhow?

May 18 '21 12:05 fmaumusINRA

Sorry you are right, I misread the initial thread. Maybe try reducing the --min-length of ORFs extracted from the genome the default 30aa (90 nucl) is quite long, that might be one issue. Could you share the two FASTA files, then I can take a look.

May 18 '21 12:05 milot-mirdita

My query is a reverse transcriptase (RT) domain. For instance

Athila41 LDAGVIYPISDSTWVSPVHCVPKKGGMTVVKNEKDELIPTRTITGHRMCIDYRKLNAASRKDHFPLPFIDQMLERLANHPYYCFLDGYSGFFQIPIHPNDQEKTTFTCPYGTFAYKRMPFGLCNAPATFQRCMTSIFSDLIEEMVEVFMDDFSVYGPSFSSCLLNLGRVLTRCEETNLVLNWEKCHFMVKEGIVLDHKISEKGIEVDKGKVEVMMQLQPPKTVKDIRSFLGHAGFYRRFIKDFA It is about 240 aa so I wonder if --min-length will help.

My target is a plant genome. The genome is too big to share (1Gb). It is from here https://www.ncbi.nlm.nih.gov/Traces/wgs/?val=APLD01. Thanks a lot for your attempts and suggestions!!!

May 18 '21 13:05 fmaumusINRA

--min-length would affect the genome in this case.

May 18 '21 13:05 milot-mirdita

I get over 37k hits, but I also split the genome db creation from the search:

mmseqs createdb APLD01.1.fsa_nt.gz APLD01.2.fsa_nt.gz ntdb
mmseqs easy-search asd.fa ntdb res.m8 tmp -s 7 --max-seqs 30000000

--min-length is not changing much though.

May 18 '21 13:05 milot-mirdita

OK, maybe I over simplified the question because I was actually counting merged hits with a minimum size of 520 nt (empirical, a bit less than RT size). I indeed get 37061 hits with mmseqs, 34268 remain after merging overlaping hits, and 1099 merged hits pass the size filter. With blast+, I get 8159 hits, 6262 merged, and 2947 merged hits pass the size filter. So yes, I get more raw hits with mmseqs but they are short :( Sorry for misleading at first !

May 18 '21 13:05 fmaumusINRA

The ORF extraction step might be a bit too naive to deal well with a plant. I guess BLAST is able to extract longer fragments for some reason, but I don't know how.

May 18 '21 14:05 milot-mirdita

Thanks a lot for your help, see you soon with another question, I am exploring still

May 18 '21 16:05 fmaumusINRA

MMseqs2 MMseqs2 copied to clipboard

too few hits

MMseqs2
MMseqs2 copied to clipboard