diamond icon indicating copy to clipboard operation
diamond copied to clipboard

difficult genome with repeated genes fails to self-align

Open stubrown opened this issue 2 years ago • 1 comments

We have found an edge case that causes a lot of trouble for Diamond. We are aligning many complete protein sets from different organisms in all-vs-all Diamond jobs. We need to find both orthologs across genomes and paralogs within genomes. Trichomonas vaginalis is a small eukaryote with an absurdly large number of duplicated genes in its genome (70K genes, >100x copies of some). This is real - tested in many different ways on many strains.

The Diamond blastp job for self blast of all T.vag proteins takes nearly forever, and typically crashes on our cluster. A Diamond job with a different genome as query aligned against the same T. vag database has no such problems - it finishes in just a few seconds. It is clearly the multiple hits per query (and per seed) that are causing the trouble.

I have experimented with many different parameters. The only one that seems to make a big impact on the compute time - and lets the job run to completion without error is the 'shapes' parameter. If I set it to '--shapes 4' then I get the job finished in about 15 minutes with a total of 1,936,910 alignments. However, when experimenting with simpler comparisons in different genomes, I see that reducing the shapes parameters does in fact reduce the number of alignments discovered = less sensitive.

Is there another way to get through this messy T.vaginals self-Blast situation with Diamond that retains more of the sensitivity?

stubrown avatar Nov 01 '23 22:11 stubrown

I'm afraid there's no good solution for this, it's a known issue. You can try --freq-masking, see here: https://github.com/bbuchfink/diamond/wiki/Advanced-options

If you only need clustering I would recommend using the clustering workflow instead of search.

bbuchfink avatar Nov 07 '23 11:11 bbuchfink