anvio icon indicating copy to clipboard operation
anvio copied to clipboard

[BUG] COG setup on offline HPC

Open MjelleLab opened this issue 3 months ago • 1 comments

COG setup on offline HPC

I have downloaded all the COG files from NCBI but I am not able to setup the database since I am on a offline HPC. Any idea how this can be done?

anvi'o version 8

MjelleLab avatar Sep 23 '25 19:09 MjelleLab

Hey @MjelleLab, we have some instructions here in the subsection "for manual downloads of the COG data (for COG 2020)":

https://anvio.org/help/main/programs/anvi-setup-ncbi-cogs/

But while going through those I realized that it is not very helpful for your case, so I changed the program to include a --dry-run parameter to tell you exactly what you need to know. But unfortunately this solution is now only on the anvio-dev branch, and it is not available to you :/

But essentially, this is what I get when I run these commands on my end for COG20 (which is the latest version of COGs you have access to in v8 I believe):

anvi-setup-ncbi-cogs --cog-version COG20 --dry-run
DRY RUN MODE WILL NOW TELL YOU THINGS AND QUIT
===============================================
The following information will tell you what anvi'o would have done to acquire
the raw files from the NCBI to set them up on your computer. Using this
information you should be able to download these files, and move them manually
to the locations shown below. Please remember that the data shown below will
depend on teh COG version of your choosing (or, if you haven't specified any,
the default COG version). Before you start following the instructions below,
make sure the following directory exists:

    - /Users/meren/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI

Then execute the following steps like a robot 🤖

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/cog-20.cog.csv
 -> and move it here .........................: /Users/meren/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/cog-20.def.tab
 -> and move it here .........................: /Users/meren/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/fun-20.tab
 -> and move it here .........................: /Users/meren/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/fun-20.tab

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/checksums.md5.txt
 -> and move it here .........................: /Users/meren/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/checksum.md5.txt

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/cog-20.fa.gz
 -> and move it here .........................: /Users/meren/github/anvio/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz

Once you have all these files in those locations, `anvi-setup-ncbi-cogs` should
not attempt to download anything, but process the existing files at these
specific locations to setup NCBI COGs on your system in peace.

But this is difficult to follow, because the directory you will have to carry these files is a different one and I'm not sure where it is on your installation. But here is a trick. I will specify a directory path that you can easily replace with whatever is the right directory for your offline HPC:

anvi-setup-ncbi-cogs --cog-version COG20 --dry-run --cog-data-dir /XXXX
DRY RUN MODE WILL NOW TELL YOU THINGS AND QUIT
===============================================
The following information will tell you what anvi'o would have done to acquire
the raw files from the NCBI to set them up on your computer. Using this
information you should be able to download these files, and move them manually
to the locations shown below. Please remember that the data shown below will
depend on teh COG version of your choosing (or, if you haven't specified any,
the default COG version). Before you start following the instructions below,
make sure the following directory exists:

    - /XXXX/COG20/RAW_DATA_FROM_NCBI

Then execute the following steps like a robot 🤖

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/cog-20.cog.csv
 -> and move it here .........................: /XXXX/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/cog-20.def.tab
 -> and move it here .........................: /XXXX/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/fun-20.tab
 -> and move it here .........................: /XXXX/COG20/RAW_DATA_FROM_NCBI/fun-20.tab

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/checksums.md5.txt
 -> and move it here .........................: /XXXX/COG20/RAW_DATA_FROM_NCBI/checksum.md5.txt

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/cog-20.fa.gz
 -> and move it here .........................: /XXXX/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz

Once you have all these files in those locations, `anvi-setup-ncbi-cogs` should
not attempt to download anything, but process the existing files at these
specific locations to setup NCBI COGs on your system in peace.

In this case you need to figure out where do you want your files to go on the HPC, and replace XXXX in the output above with that directory path.

But how to do that? If you want your COG data to live in a specific place, it is easy (but it is less convenient). If you want your COG data to live its default place, it is relatively less easy (but much more convenient in the long run -- since the former will require you to use --cog-data-dir to specify the location of COG data every time you will use anvi-run-ncbi-cogs).

To figure out the right HPC directory, first run this command in your conda environment for anvi'o on the HPC:

ls $CONDA_PREFIX/lib/python3.10/site-packages/anvio/data/misc/

If you don't get a no such file or directory error, then you are golden. First run this command,

mkdir -p $CONDA_PREFIX/lib/python3.10/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/

And then follow these instructions (where XXX is replaced with the correct directory path):

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/cog-20.cog.csv
 -> and move it here .........................: $CONDA_PREFIX/lib/python3.10/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/cog-20.def.tab
 -> and move it here .........................: $CONDA_PREFIX/lib/python3.10/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/fun-20.tab
 -> and move it here .........................: $CONDA_PREFIX/lib/python3.10/site-packages/anvio/data/misc/COGCOG20/RAW_DATA_FROM_NCBI/fun-20.tab

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/checksums.md5.txt
 -> and move it here .........................: $CONDA_PREFIX/lib/python3.10/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/checksum.md5.txt

Download this file ...........................: ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/cog-20.fa.gz
 -> and move it here .........................: $CONDA_PREFIX/lib/python3.10/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz

After doing these, you should be able to run this command on your HPC, and everything should work:

anvi-setup-ncbi-cogs --cog-version COG20

Please let us know if it works (or doesn't).

meren avatar Sep 24 '25 08:09 meren