DNAnalyzer icon indicating copy to clipboard operation
DNAnalyzer copied to clipboard

Improve Performance: Currently Highly Resource-Intensive

Open VerisimilitudeX opened this issue 3 years ago • 1 comments

From @LimesKey;

After some more testing on the program, one of the negatives of the program is that it's incredibly resource-intensive. With a high-performance computer with an 8-core CPU and 16GB of Physical RAM and over 150GB of Virtual Memory on an NVMe SSD, it used up 140GB of combined Physical + Virtual RAM + 80% Core Utilization.

This was only for creating a database for a 3.2GB Sequenced DNA file (uncompressed), I'm not sure if we can provide the created database file for the user who downloads the program or if they have to create the database file themselves to be used with their Fasta file (DNA file). There's probably a way to make it not use VRAM & RAM and limit it to 20% of CPU Cores, but even then it would take at least half an hour and the minimum requirements would be a 200GB+ SSD.

In the real world though, we can totally trim a lot of data off the input file by the user to not have to scan and compare the entire file, making the time a lot shorter.

VerisimilitudeX avatar Oct 21 '22 15:10 VerisimilitudeX

There are certainly many ways you can make the DIAMOND program a lot less resource-intensive, setting memory limit to only 10% of available ram and by storing memory that would otherwise be taken up by physical ram, into a virtual memory file similar to how page files work but not in the page file. I haven't found the program to be too CPU intensive, 60% of my 8 logical cores, but I know there's built-in functionality to limit cores and threads.

I do want this program to be at least runnable on a 300$ laptop in under 20 minutes, that's the main goal. If it's too difficult to implement (I'll do some more research myself), we/I can choose another program.

LimesKey avatar Oct 21 '22 16:10 LimesKey

Not planning to use DIAMOND anymore.

LimesKey avatar Nov 02 '22 05:11 LimesKey

This feature would still be implemented by removing the headers from the FASTA files the user provides.

VerisimilitudeX avatar Nov 02 '22 15:11 VerisimilitudeX

I am working on this, cutting-down and processing the users inputted file and converting it to .fab and .fabc (compressed) for the database files. I'm just waiting for a sample file export from 23AndMe to figure out which parts the program will not use.

LimesKey avatar Nov 02 '22 22:11 LimesKey

Sounds good. Can't you use any FASTA file though, not just from 23AndMe?

VerisimilitudeX avatar Nov 03 '22 01:11 VerisimilitudeX

I want to use the most common inputted files as a template. I thought the 23AndMe file would be the most commonly used file therefore that would let the program be more optimized and made for these types of files.

LimesKey avatar Nov 03 '22 01:11 LimesKey

They're mostly the same...We shouldn't depend on the style of one company in particular; You should use data from the Human Genome Project; we can plan for accommodating another formatting later as the data we provide under assets/dna/* is in that format.

VerisimilitudeX avatar Nov 03 '22 01:11 VerisimilitudeX

@Verisimilitude11 will be creating an issue in the DIAMOND repo discussing this, hopefully, we can solve this.

LimesKey avatar Nov 07 '22 06:11 LimesKey

If we can shrink both the inputted and database file, there will be a massive improvement in DIAMOND's performance If we can also tweak DIAMOND in a way to read/analyze a binary file, it would be great

LimesKey avatar Nov 07 '22 06:11 LimesKey

Stale issue message

github-actions[bot] avatar Mar 08 '23 18:03 github-actions[bot]

Stale issue message

github-actions[bot] avatar May 09 '23 18:05 github-actions[bot]

I did a little bit of tweaking and found that ./diamond blastx -q Brassica_cretica.gz --db plant-proteins.dmnd -o out.tsv --block-size 0.6 --threads 3 --fast uses about 50% of my CPU, and 3GB of RAM.

LimesKey avatar Jun 09 '23 12:06 LimesKey

I did a little bit of tweaking and found that ./diamond blastx -q Brassica_cretica.gz --db plant-proteins.dmnd -o out.tsv --block-size 0.6 --threads 3 --fast uses about 50% of my CPU, and 3GB of RAM.

That's pretty good actually. It's efficient enough to run on GitHub Codespaces, which is something I'm aiming for.

VerisimilitudeX avatar Jun 09 '23 16:06 VerisimilitudeX