DRAM Is it possible to resume the database preparation step?

Hello DRAM developers,

I am running database setup without KEGG on a university cluster with 1Tb of memory. The process has been running for over 15 hours, it seems new files are still being produced. This is not consistent with the documentation that the database prep step takes up to 5 hours with 512Gb of memory. So I am wondering if there is anything unexpected here?

Here is a list of files currently in my dram_db directory

CAZyDB.07302020.fam-activities.txt Pfam-A.hmm.dat.gz amg_database.20210622.tsv database_files dbCAN-HMMdb-V9.txt dbCAN-HMMdb-V9.txt.h3f dbCAN-HMMdb-V9.txt.h3i dbCAN-HMMdb-V9.txt.h3m dbCAN-HMMdb-V9.txt.h3p description_db.sqlite description_db.sqlite-journal etc_mdoule_database.20210622.tsv function_heatmap_form.20210622.tsv genome_summary_form.20210622.tsv kofam_ko_list.tsv kofam_profiles.hmm kofam_profiles.hmm.h3f kofam_profiles.hmm.h3i kofam_profiles.hmm.h3m kofam_profiles.hmm.h3p module_step_form.20210622.tsv peptidases.20210622.mmsdb peptidases.20210622.mmsdb.dbtype peptidases.20210622.mmsdb.idx peptidases.20210622.mmsdb.idx.dbtype peptidases.20210622.mmsdb.idx.index peptidases.20210622.mmsdb.index peptidases.20210622.mmsdb.lookup peptidases.20210622.mmsdb.source peptidases.20210622.mmsdb_h peptidases.20210622.mmsdb_h.dbtype peptidases.20210622.mmsdb_h.index pfam.mmspro pfam.mmspro.dbtype pfam.mmspro.idx pfam.mmspro.idx.dbtype pfam.mmspro.idx.index pfam.mmspro.index pfam.mmspro_h pfam.mmspro_h.dbtype pfam.mmspro_h.index refseq_viral.20210622.mmsdb refseq_viral.20210622.mmsdb.dbtype refseq_viral.20210622.mmsdb.idx refseq_viral.20210622.mmsdb.idx.dbtype refseq_viral.20210622.mmsdb.idx.index refseq_viral.20210622.mmsdb.index refseq_viral.20210622.mmsdb.lookup refseq_viral.20210622.mmsdb.source refseq_viral.20210622.mmsdb_h refseq_viral.20210622.mmsdb_h.dbtype refseq_viral.20210622.mmsdb_h.index uniref90.20210622.mmsdb uniref90.20210622.mmsdb.dbtype uniref90.20210622.mmsdb.idx uniref90.20210622.mmsdb.idx.dbtype uniref90.20210622.mmsdb.idx.index uniref90.20210622.mmsdb.index uniref90.20210622.mmsdb.lookup uniref90.20210622.mmsdb.source uniref90.20210622.mmsdb_h uniref90.20210622.mmsdb_h.dbtype uniref90.20210622.mmsdb_h.index vog_annotations_latest.tsv.gz vog_latest_hmms.txt vog_latest_hmms.txt.h3f vog_latest_hmms.txt.h3i vog_latest_hmms.txt.h3m vog_latest_hmms.txt.h3p

Any insight on how much longer is it going to take? I assigned 24 hours on this job but I am not sure if it’s going to finish on time, in case the job runs out of time, is it possible to resume this process so I don’t have to start all over again … (which I had done multiple times…. it is a pain….)

Thank you!

Rui

Jun 23 '21 13:06 rzhan186

Just an update, so database preparation didn't finish within 24 hours, it seems the only file that keeps getting bigger is a file called "description_db.sqlite", currently with a size of 14.85Gb. I suspect the issue of the long-running time might be related to the creation of this file. I am wondering how big is this file supposed to be?

Jun 24 '21 02:06 rzhan186

Hi Rui, I am also facing the same scenario. It has been more than 12 hours for me, and currently the only file that is used is description_db.sqlite. Its current size is 15.6 GB. I don't know what size it is suppose to be.

I think that adding UniRef in the database is causing the increase in time. Earlier I build the database without UniRef, it completed within few hours.

Ankit

Jun 24 '21 03:06 ankit4035

Hi Ankit,

Thanks for the info! I am gonna give it shot without UniRef then. I created another job in the cluster and assigned 60 hours, I will let you know if it ever finishes.

Best, Rui

Jun 24 '21 06:06 rzhan186

Thanks for helping out @ankit4035. Building the description database is often the longest step of set up. It is very dependent on your disk write speed and final size is around 20 GB.

You can check if the rest of the database locations were set by running DRAM-setup.py print_config. They they are set then you can run DRAM-setup.py update_description_db. If they are not then you can use the DRAM-setup.py set_database_locations command to tell DRAM where to find the databases that were downloaded and set up during your first attempt. Then you can run DRAM-setup.py update_description_db.

We have plans to add the ability to resume setup after it fails but this is not yet possible.

Jul 02 '21 19:07 shafferm

Thank you @shafferm for your help! I have an idea now of how much longer it's gonna take to finish building the database.

@ankit4035 Just an update, I assigned 502GB of memory with 32 cores for the full database building process (including Uniref), the description_db.sqlite file ended up with a size of 17.89GB after two days of running. So I guess with the above settings, 3 days should be sufficient for the whole process. Make sure you use the --thread flag with running DRAM-setup.py.

Jul 03 '21 06:07 rzhan186

FYI this will be possible with dram2 depending on how you use it for example it will be possible to use dram2 as a snake make module.

Jan 09 '23 17:01 rmFlynn

DRAM DRAM copied to clipboard

Is it possible to resume the database preparation step?

DRAM
DRAM copied to clipboard