diamond --frameshift parameter value

Hey,

I am currently investigating how to best use diamond with long, error prone reads. I am aware of the --long-reads option. I wonder why the recommended value of -F is 15?

I made a small analysis in which I took some RefSeq assemblies and cut them up into long reads. Then I copy the reads and introduce errors such that this second read set is similar to nanopore reads, in terms of rates of substitutions, insertions and deletions (11% overall error rate). Using different frameshift parameters to map the reads to a reference database using diamond blastx, I get the following results:

diamond_frameshift

Could you explain why -F 15 is recommended instead of -F 1? It seems like I am losing most hits using -F 15.

Best, Stefan

Jan 27 '22 13:01 EbmeyerSt

I must admit that I never looked into this, the frameshift feature was modeled after the LAST aligner and this is what the author recommends as penalty. This is certainly an interesting observation and worth further study!

Jan 31 '22 10:01 bbuchfink

Ok, thank you for your reply! I will use -F 1 then until I read something opposing that. Thank you for developing and maintaining diamond, I use it a lot and it's incredibly helpful!

Jan 31 '22 10:01 EbmeyerSt

I have put a bit more research into this, as I thought my choice of --id, coupled with the high error rate might have affected the results. So I redid the analysis with different sequence percent identity cutoffs to the reference, and a set of fake reads with a lower error rate.

80% identity diamond_frameshift80

70% identity diamond_frameshift70

60% identity diamond_frameshift60

So it seems like the percent identity in combination with the error rate influence the number of hits quite strongly, and leads to less hits the higher the frameshift penalty at high error rates. At higher identities >70%, I will opt for -F 1 for long reads with higher error rates.

Feb 01 '22 13:02 EbmeyerSt