ColabFold icon indicating copy to clipboard operation
ColabFold copied to clipboard

Invalid database read for database data file

Open lyy-369 opened this issue 6 months ago • 5 comments

Invalid database read for database data file=/home/lyy/data/db_folder/uniref30_2302_db.idx, database index=/home/lyy/data/db_folder/uniref30_2302_db.idx.index getData: local id (4294967295) >= db size (19)

Error: Prefilter died Traceback (most recent call last): File "/home/lyy/data/db_folder/localcolabfold/colabfold-conda/bin/colabfold_search", line 8, in sys.exit(main()) File "/home/lyy/data/db_folder/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/mmseqs/search.py", line 319, in main mmseqs_search_monomer( File "/home/lyy/data/db_folder/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/mmseqs/search.py", line 91, in mmseqs_search_monomer run_mmseqs(mmseqs, ["search", base.joinpath("qdb"), dbbase.joinpath(uniref_db), base.joinpath("res"), base.joinpath("tmp"), "--threads", str(threads)] + search_param) File "/home/lyy/data/db_folder/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/mmseqs/search.py", line 25, in run_mmseqs subprocess.check_call([mmseqs] + params) File "/home/lyy/data/db_folder/localcolabfold/colabfold-conda/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '[PosixPath('mmseqs'), 'search', PosixPath('msas/qdb'), PosixPath('/home/lyy/data/db_folder/uniref30_2302_db'), PosixPath('msas/res'), PosixPath('msas/tmp'), '--threads', '64', '--num-iterations', '3', '--db-load-mode', '0', '-a', '-e', '0.1', '--max-seqs', '10000', '--k-score', "'seq:96,prof:80'"]' returned non-zero exit status 1.

lyy-369 avatar Jul 02 '25 10:07 lyy-369

I think you are using a GPU database with MMseqs2-CPU.

If you remove --index-subset 2 from this script: https://github.com/sokrypton/ColabFold/blob/747aa90a5bac8b12c58292142dc445354bd3c36a/setup_databases.sh#L94

The database will work for both CPU and GPU.

milot-mirdita avatar Jul 02 '25 10:07 milot-mirdita

I think you are using a GPU database with MMseqs2-CPU.

If you remove --index-subset 2 from this script:

ColabFold/setup_databases.sh

Line 94 in 747aa90 GPU_INDEX_PAR=" --split 1 --index-subset 2"

The database will work for both CPU and GPU.

Thanks, I will try.
rm -f uniref30_2302_db.idx* mmseqs createindex uniref30_2302_db idx_tmp --remove-tmp-files 1 --split 1

lyy-369 avatar Jul 03 '25 01:07 lyy-369

I think you are using a GPU database with MMseqs2-CPU.

If you remove --index-subset 2 from this script:

ColabFold/setup_databases.sh

Line 94 in 747aa90

GPU_INDEX_PAR=" --split 1 --index-subset 2" The database will work for both CPU and GPU.

Can you clarify this a bit more (i.e. about hardware agnostic database)?. Is this something you have tried and advise other to do? Thanks

eborobert avatar Jul 03 '25 17:07 eborobert

Yes, we developed the GPU databases so that they also work with MMseqs2-CPU.

Due to changes in ordering of the sequences MMseqs2-CPU can return slightly different results when the database does not fully fit into system memory and it needs to process the database in chunks. Otherwise its the same as setting up the CPU-databases without GPU=1.

The --index-subset 2 parameter causes all data structures specific to the MMseqs2-CPU prefilter to be omitted, resulting in a considerably smaller file on-disk, but the database only works with the GPU or CPU ungappedprefilter instead of the CPU k-mer prefilter. If you don't specify this parameter, it adds these data structures so the database works with both algorithms (prefilter and ungappedprefilter).

milot-mirdita avatar Jul 04 '25 04:07 milot-mirdita

Thanks for the clarification.

eborobert avatar Jul 04 '25 10:07 eborobert