ColabFold Required DB's for GPU Server setup

Main question is -What Is a must to be downloaded to the local disk for gpuserver mode setup ?

regarding Optional GPU server for enhanced performance ->

I'm confused as what being loaded to GPU vRAM compared to local /path/to/db on disk GPU vRAM is no near close to local SSD size .. what am i missing ?

GPU vRAM is ~15GB for colabfold_envdb_202108_db and uniref30_2302_db (mmseqs gpuserver ./colabfold_envdb_202108_db ...)

-- local disk size Is around ~800GB for both DB's

(.venv) [ec2-user@ip-10-4-28-230 db]$ du -ch uniref30_2302_db_seq* | grep total$
182G    total
(.venv) [ec2-user@ip-10-4-28-230 db]$ du -ch colabfold_envdb_202108_db* | grep total$
597G    total

using colabfold_search I'm specifying both local db path and gpu-server option colabfold_search <input.fasta> <db_path> <results> --gpu 1 --gpu-server 1

I've use setup_databases.sh from ColabFold repo (Link) which download all DB's ?

Aug 03 '25 17:08 eyal-converge

Only the contents of uniref30_2302_db (8.2 GB) and colabfold_envdb_202108 (36GB) need to be in VRAM. These are the cluster consensus sequences that are searched against on GPU. The rest of the workflow (expand,alignment) is still on CPU and should be sufficiently fast if they are loaded from a fast local drive (somewhat modern NVMe) and shouldn't need to be fully in RAM.

Aug 03 '25 18:08 milot-mirdita

Only the contents of uniref30_2302_db (8.2 GB) and colabfold_envdb_202108 (36GB) need to be in VRAM. These are the cluster consensus sequences that are searched against on GPU. The rest of the workflow (expand,alignment) is still on CPU and should be sufficiently fast if they are loaded from a fast local drive (somewhat modern NVMe) and shouldn't need to be fully in RAM.

Thanks, The end goal is to create a reproducible container capable of handling dozens to hundreds of MSA calculations per run.

A quick clarifying question regarding colabfold_search usage with a GPU server - What are the minimum required files on disk ?

Specifically, colabfold_envdb_202108_db.idx, colabfold_envdb_202108_db_seq, and .._aln take up almost 500GB, and similarly, uniref30_2302_x consumes approximately 400GB.

as in setup_databases.sh it says:

Set MMSEQS_NO_INDEX to skip the index creation step (not useful for colabfold_search in most cases)

(Which is not set by default)

Aug 04 '25 09:08 eyal-converge