bakta icon indicating copy to clipboard operation
bakta copied to clipboard

a lighter database?

Open GaioTransposon opened this issue 2 years ago • 1 comments

Hi there and thank you for the tool,

is there an option to download only part of the database? https://zenodo.org/record/5961398/files/db.tar.gz) is nearly 30GB and it takes about 12 hours to download (I am using bakta_db download --output . with bakta installed with conda.

what if one just wants to use only one of the DBs (eg.: UniProtKB/Swiss-Prot: 2021_04) ?

Kind Regards Dany

GaioTransposon avatar Feb 20 '22 13:02 GaioTransposon

Hi Dany, thanks for reaching out. Yes, DB size is sometimes and for some users an issue. As we decided to come up with a taxonomically untargeted approach and database, it has become fairly large.

The two largest parts of the DB are the PSC Diamond db (UniRef90 cluster representative sequences) and the SQLite db storing the ~200 million IPS sequence hashes (UniRef100) along with all pre-compiled annotations. Therefore, excluding many except of just one annotation DB wouldn't result in significant DB size reductions.

One option to reduce the databse size (that I already thought about) is to compile sub databases for certain phyla. Of course, that would imply a couple of things to develop, implement and test and thus would take its time on a mid term schedule. If this would be of interest for more users, we'd happily address that.

Another option would be to host the database on more servers that distributed around the globe and thus might provide more bandwidth and better download times. Might that help in your case? Do you know of any free hosting services that would be eligible?

Best regards, Oliver

oschwengers avatar Feb 21 '22 08:02 oschwengers

Another idea (inspired by @tseemann) is to use a ranked set of broader protein clusters. This could be addressed by skipping the IPS and PSC from the normal database and use a size-filtered subset of the PSCC, only.

A quick check on Uniprot/UniRef50 revealed 2,660,356 UniRef50 proteins. I'd estimate a size reduction of the entire database down to let's say 3-4 Gb.

oschwengers avatar Feb 16 '23 09:02 oschwengers

Hi @GaioTransposon, fyi: you might be interested in v1.7.0 which introduces a light database version as described in https://github.com/oschwengers/bakta/pull/196

This lightweight version is only 1.2 Gb zipped and 3 Gb unzipped.

oschwengers avatar Feb 24 '23 15:02 oschwengers

EDIT: it was a fault conda installation (I think scales was missing), it's working now :), and using the latst biocontainer build also now works :)

~~I just tried this with 1.7.0 but I get the following error (both via bioconda intsall conda tool, and also the corresponding singularity biocontainer)~~

$ bakta_db download --type light
Bakta software version: 1.7.0
Required database schema version: 5

fetch DB versions...
	... compatible DB versions: 1
download database: v5.0, type=light, 2023-02-20, DOI: 10.5281/zenodo.7669534, URL: https://zenodo.org/record/7669534/files/db-light.tar.gz...
Traceback (most recent call last):
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 91, in validator
    result = CONFIG_VARS[key](value)
KeyError: 'scale'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jfellows/.conda/envs/bakta/bin/bakta_db", line 10, in <module>
    sys.exit(main())
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/bakta/db.py", line 203, in main
    download(db_url, tarball_path)
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/bakta/db.py", line 119, in download
    with alive_bar(total=total_length, scale='SI') as bar:
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/progress.py", line 95, in alive_bar
    config = config_handler(**options)
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 82, in create_context
    local_config.update(_parse(theme, options))
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 106, in _parse
    return {k: validator(k, v) for k, v in options.items()}
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 106, in <dictcomp>
    return {k: validator(k, v) for k, v in options.items()}
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 96, in validator
    raise ValueError('invalid config name: {}'.format(key))
ValueError: invalid config name: scale

~~Did I miss something in my command, for example?~~

~~Conda environment creation: conda create -n bakta -c bioconda bakta~~

jfy133 avatar Mar 02 '23 11:03 jfy133

Yes, the 3rd party dependencies needed an update. It should work, now.

oschwengers avatar Mar 02 '23 13:03 oschwengers