anvio icon indicating copy to clipboard operation
anvio copied to clipboard

Issue with new anvi-setup-ncbi-cogs

Open Sirbius opened this issue 4 years ago • 20 comments

anvi-setup-ncbi-cogs gets stuck

Hi guys, it's me again. I'm trying to use the new awesome COG20 but got problem with anvi-setup-ncbi-cogs. Since I had problems with the automathic download, I downloaded with wget the files from NCBI COG database, selected the folder with --cog-data-dir but it get stuck at the BLAST search db, giving back the prompt, without saying anything.. What else can I try? Thanks

:: anvi'o v7 ::  /share/Groups/Pathology >>> anvi-setup-ncbi-cogs --cog-version COG20 --cog-data-dir ./COG-DATA-DIR -T 16 --just-do-it
COG version ..................................: COG20
COG data source ..............................: The command line parameter.
COG base directory ...........................: /share/Groups/Pathology/COG-DATA-DIR

warning
===============================================
This program will first check whether you have all the raw files, and then will
attempt to regenerate everything that is necessary from them.

Diamond log ..................................: /share/Groups/Pathology/COG-DATA-DIR/COG20/DB_DIAMOND/log.txt
Diamond search db ............................: /share/Groups/Pathology/COG-DATA-DIR/COG20/DB_DIAMOND/COG.dmnd
BLAST log ....................................: /share/Groups/Pathology/COG-DATA-DIR/COG20/DB_BLAST/log.txt
BLAST search db ..............................: /tmp/tmp0iyq1zm5   

anvi'o version

:: anvi'o v7 ::  /share/Groups/Pathology >>> anvi-self-test --version

Anvi'o .......................................: hope (v7)
Profile database .............................: 35
Contigs database .............................: 20
Pan database .................................: 14
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 2
tRNA-seq database ............................: 1

System info

I downloaded anvio docker container on owr server running Centos7.

Sirbius avatar Jan 22 '21 11:01 Sirbius

Hey @Sirbius, let's see if we can figure out your problem :) It looks like the setup script was able to find the raw files you downloaded, which is great.

Can you possibly run this command again with the --debug flag and let me know what you see in the output? Also, after this program gives back the prompt, if you look in share/Groups/Pathology/COG-DATA-DIR/COG20/DB_BLAST/COG/, do you see several blast database files like COG.fa.00.phr, COG.fa.00.pin, etc? And what does it say in share/Groups/Pathology/COG-DATA-DIR/COG20/DB_BLAST/log.txt?

A side note for anyone else from the public who wants to use this hack to sidestep the automatic download: the COG setup script expects the raw NCBI data to be in a specific folder called 'RAW_DATA_FROM_NCBI' within your --cog-data-dir folder. That means that after downloading the NCBI files with wget, you should move them into this directory structure, kind of like this:

cd COG-DATA-DIR
mkdir RAW_DATA_FROM_NCBI
mv cog-20.cog.csv RAW_DATA_FROM_NCBI
mv cog-20.def.tab RAW_DATA_FROM_NCBI
mv fun-20.tab RAW_DATA_FROM_NCBI
mv cog-20.fa.gz RAW_DATA_FROM_NCBI

Otherwise anvi'o will not be able to find them. (the four files in the mv commands are the ones that anvi'o expects for COG20.)

ivagljiva avatar Jan 22 '21 17:01 ivagljiva

(Also @Sirbius - if you don't mind saying, what sort of issues are you having with the automatic download? If it is not a server-specific problem, we may be able to solve it)

ivagljiva avatar Jan 22 '21 18:01 ivagljiva

Hi @ivagljiva, thank you for your reply. I have to say, I only need to download automatically only the fa.gz file, there is no problem with the others and they are automatically downloaded in the proper folder, inside RAW_DATA_FROM_NCBI. This is the output with the --debug option, but I guess it's not so informative..

`:: anvi'o v7 ::  /home/silviat/Pathology >>> anvi-setup-ncbi-cogs --cog-version COG20 --cog-data-dir ./COG-DATA-DIR -T 16 --debug
COG version ..................................: COG20
COG data source ..............................: The command line parameter.
COG base directory ...........................: /home/silviat/Pathology/COG-DATA-DIR

WARNING
===============================================
This program will first check whether you have all the raw files, and then will
attempt to regenerate everything that is necessary from them.

Press ENTER to continue, or press CTRL + C to cancel...

Diamond log ..................................: /home/silviat/Pathology/COG-DATA-DIR/COG20/DB_DIAMOND/log.txt                
                                                                                                                             
[DEBUG] `run_command` is running .............: diamond makedb --in /tmp/tmprubo0vms -d
                                                /home/silviat/Pathology/COG-DATA-DIR/COG20/DB_DIAMOND/COG -p 16

Diamond search db ............................: /home/silviat/Pathology/COG-DATA-DIR/COG20/DB_DIAMOND/COG.dmnd               
BLAST log ....................................: /home/silviat/Pathology/COG-DATA-DIR/COG20/DB_BLAST/log.txt
                                                                                                                             
[DEBUG] `run_command` is running .............: makeblastdb -in /tmp/tmprubo0vms -dbtype prot -out
                                                /home/silviat/Pathology/COG-DATA-DIR/COG20/DB_BLAST/COG/COG.fa

BLAST search db ..............................: /tmp/tmprubo0vms 

` This is the content of COG-DATA-DIR:

:: anvi'o v7 :: /home/silviat/Pathology/COG-DATA-DIR/COG20 >>> ls
CATEGORIES.txt	COG.txt  DB_BLAST  DB_DIAMOND  MISSING_COG_IDs.cPickle	PID-TO-CID.cPickle  RAW_DATA_FROM_NCBI

And this is the content of RAW_DATA_FROM_NCBI.
cog-20.cog.csv cog-20.def.tab cog-20.fa.gz fun-20.tab

And there is no DB_BLAST folder!

When I run the automatic download I get this error:

anvi-setup-ncbi-cogs --cog-version COG20  -T 16 --debug
COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG

WARNING
===============================================
This program will first check whether you have all the raw files, and then will
attempt to regenerate everything that is necessary from them.

Press ENTER to continue, or press CTRL + C to cancel...

Downloaded successfully ......................: /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv
Downloaded successfully ......................: /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab
Downloaded successfully ......................: /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/fun-20.tab
Downloaded successfully ......................: /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz
                                                                                
Traceback for debugging
================================================================================
  File "/opt/conda/envs/anvioenv/bin/anvi-setup-ncbi-cogs", line 47, in <module>
    setup.create()
  File "/opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/cogs.py", line 617, in create
    self.setup_raw_data()
  File "/opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/cogs.py", line 831, in setup_raw_data
    self.files[file_name]['func'](file_path, J(self.COG_data_dir, self.files[file_name]['formatted_file_name']))
  File "/opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/cogs.py", line 757, in format_protein_db
    raise ConfigError(f"Something went wrong while decompressing the downloaded file :/ It is likely that "
================================================================================


Config Error: Something went wrong while decompressing the downloaded file :/ It is likely    
              that the download failed and only part of the file was downloaded. If you would 
              like to try again, please run the setup command with the flag `--reset`. Here is
              what the downstream library said: 'Error -3 while decompressing data: invalid   
              code lengths set'.                                                              

And infact, the fa.gz file is not of the expected size (616MB)

ls -lh /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/
total 336M
-rw-r--r-- 1 root root 334M Jan 22 18:48 cog-20.cog.csv
-rw-r--r-- 1 root root 364K Jan 22 18:48 cog-20.def.tab
-rw-r--r-- 1 root root 924K Jan 22 18:49 cog-20.fa.gz
-rw-r--r-- 1 root root 1.2K Jan 22 18:48 fun-20.tab

That's why I also tried to download the cog-20.fa.gz directly in the automatic folder /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/ and run again the setup, which got stuck at the same point as above.

I would like to add that yesterday at some point I got a different error, which unfortunately I did not save but I can retrieve part of it from my browser history, when I tried to understand it.

File "/opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/cogs.py", line 704, in format_cog_names COG, category, 
function, nn, pathway, pubmed_id, PDB_id = line.strip('\n').split('\t') ValueError: too many values to unpack (expected 7)

Here is what the downstream library said: 'Error -3 while decompressing data: invalid code lengths set'.

I thought that maybe the new COG20 file format was different than the 2014 version, like more columns than expected, but after checking the files and also the cog.py script I thought everything was fine. Another incredibile thing is that COG-DATA-DIR/COG20/ folder downloaded yesterday this morning was named COG-DATA-DIR/COG14/ !!! I got some ghosts in the server room I guess :P

Sirbius avatar Jan 22 '21 19:01 Sirbius

I think the best solution here is to delete everything under COG via

rm -rf /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/*

(while making extra sure there is no space between * and /)

And try again. The original file seems to be broken.

meren avatar Jan 22 '21 19:01 meren

Hi everyone, I have the very same problem. I've been trying to run anvi-setup-ncbi cogs for 2 days now with more or less the same outputs than Sirbius... I also tried to download the cog database myself with same results. Screen Shot 2021-01-22 at 20 46 17

lvelosuarez avatar Jan 22 '21 19:01 lvelosuarez

Sorry to bother... I rerun several times and at the end it worked... I don't know why it did not work yesterday and it worked today to be honest

lvelosuarez avatar Jan 22 '21 19:01 lvelosuarez

Sorry to hear, @lvelosuarez, but I'm glad it worked eventually :/ Because we insist on using upstream data rather than storing it in our distribution, server connectivity issues between you and the upstream sometimes results in incomplete downloads, and anvi'o doesn't realize it's been a very long time and should simply try again from scratch.

Random developer Idea: It would've been excellent to see if we can set a timeout parameter to our downloader.

meren avatar Jan 22 '21 19:01 meren

Ok, I've been trying to delete and re-run anvi-setup-ncbi-cogs and I always get this error:

`anvi-setup-ncbi-cogs --cog-version COG20  -T 16 --debug --just-do-it
COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG

WARNING
===============================================
This program will first check whether you have all the raw files, and then will
attempt to regenerate everything that is necessary from them.

Downloaded successfully ......................: /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv
Downloaded successfully ......................: /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab
Downloaded successfully ......................: /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/fun-20.tab
Downloaded successfully ......................: /opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz
[23 Jan 21 11:37:31 Formatting protein ids to COG ids file] 95.55%    ETA: NoneTraceback (most recent call last):
  File "/opt/conda/envs/anvioenv/bin/anvi-setup-ncbi-cogs", line 47, in <module>
    setup.create()
  File "/opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/cogs.py", line 617, in create
    self.setup_raw_data()
  File "/opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/cogs.py", line 831, in setup_raw_data
    self.files[file_name]['func'](file_path, J(self.COG_data_dir, self.files[file_name]['formatted_file_name']))
  File "/opt/conda/envs/anvioenv/lib/python3.6/site-packages/anvio/cogs.py", line 659, in format_p_id_to_cog_id_cPickle
    p_id = fields[2].replace('.', '_')
IndexError: list index out of range

I also think that sometimes it just randomly works. I'll try again from the office with a better network connection, otherwise I'll just stick to COG14 :(

Sirbius avatar Jan 23 '21 11:01 Sirbius

Hi guys, Just FYI. After today random run, I found the BLAST_DB/ and DIAMOND_DB/ inside the COG20/ and could read the log.txt (same as --debug option I guess).

# DATE: 24 Jan 21 10:54:34
# CMD LINE: diamond makedb --in /tmp/tmpr9z384cn -d /home/silviat/Pathology/COG-DATA-DIR/COG20/DB_DIAMOND/COG -p 16
diamond v2.0.6.144 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org

#CPU threads: 16
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /tmp/tmpr9z384cn
Opening the database file...  [0.011s]
Loading sequences...  [4.949s]
Masking sequences...  [4.569s]
Writing sequences...  [11.918s]
Hashing sequences...  [0.306s]
Loading sequences...  [0.732s]
Masking sequences...  [0.658s]
Writing sequences...  [1.323s]
Hashing sequences...  [0.049s]
Loading sequences...  [0.001s]
Writing trailer...  [0.671s]
Closing the input file...  [0.001s]
Closing the database file...  [0.419s]
Database hash = 84f947b4825b1bf8eee04e8d019f368b
Processed 3213025 sequences, 1150770183 letters.
Total time = 25.632s
[24 Jan 21 10:55:00] diamond makedb cmd ...........................: diamond makedb --in
                                                /tmp/tmpr9z384cn -d
                                                /home/silviat/Pathology/COG-DATA-DIR/COG20/DB_DIAMOND/COG
                                                -p 16
[24 Jan 21 10:55:00] Diamond search db ............................: /home/silviat/Pathology/COG-DATA-DIR/COG20/DB_DIAMOND/COG.dmnd 

Since the script stops when creating the database, I thought I could just build up everything on my own by downloading the files and then create the db with the below commands, which perfectly worked. Could you tell me what else the script is supposed to run?

diamond makedb --in cog-20.fa -d /home/silviat/Pathology/COG-DATA-DIR/COG20/DB_DIAMOND/COG -p 32
makeblastdb -in cog-20.fa -dbtype prot -out /home/silviat/Pathology/COG-DATA-DIR/COG20/DB_BLAST/COG/COG.fa`
```
But when I run anvi-run-ncbi-cogs --cog-version COG20 it says I have only COG14! And guess what, I found the ghost changing the folder name! 
```
`:: anvi'o v7 ::  /home/silviat/Pathology/COG-DATA-DIR >>> ls -R
.:
COG20

./COG20:
COG14

./COG20/COG14:
CATEGORIES.txt	DB_BLAST    MISSING_COG_IDs.cPickle  RAW_DATA_FROM_NCBI
COG.txt		DB_DIAMOND  PID-TO-CID.cPickle

./COG20/COG14/DB_BLAST:
COG  log.txt

./COG20/COG14/DB_BLAST/COG:
COG.fa.00.phr  COG.fa.00.psq  COG.fa.01.pin  COG.fa.pal  COG.fa.pot  COG.fa.pto
COG.fa.00.pin  COG.fa.01.phr  COG.fa.01.psq  COG.fa.pdb  COG.fa.ptf

./COG20/COG14/DB_DIAMOND:
COG.dmnd  log.txt

./COG20/COG14/RAW_DATA_FROM_NCBI:
cog-20.cog.csv	cog-20.def.tab	cog-20.fa  fun-20.tab` 
```

Sirbius avatar Jan 24 '21 17:01 Sirbius

guess what, I found the ghost changing the folder name!

So how is this happening again? This should never happen:

./COG20:
COG14

Both COG14 and COG20 should be underneath the directory COG-DATA-DIR/. Perhaps there is a problem with user-specified directories :/ I will look into this now.

Could you tell me what else the script is supposed to run?

The script runs a lot of other things to ensure integrity between files. It is not possible to do it manually :(

meren avatar Jan 24 '21 17:01 meren

Nope. It's not about that either. I was able to setup both versions of COG without any problem in separate directories underneath a user-defined path:

>>> anvi-setup-ncbi-cogs --cog-data-dir COGS-DATA-DIR -T 4 --just-do-it
COG version ..................................: COG20
COG data source ..............................: The command line parameter.
COG base directory ...........................: /Users/meren/github/anvio/COGS-DATA-DIR

WARNING
===============================================
This program will first check whether you have all the raw files, and then will
attempt to regenerate everything that is necessary from them.

Downloaded successfully ......................: /Users/meren/github/anvio/COGS-DATA-DIR/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv
Downloaded successfully ......................: /Users/meren/github/anvio/COGS-DATA-DIR/COG20/RAW_DATA_FROM_NCBI/cog-20.def.tab
Downloaded successfully ......................: /Users/meren/github/anvio/COGS-DATA-DIR/COG20/RAW_DATA_FROM_NCBI/fun-20.tab
Downloaded successfully ......................: /Users/meren/github/anvio/COGS-DATA-DIR/COG20/RAW_DATA_FROM_NCBI/cog-20.fa.gz
Diamond log ..................................: /Users/meren/github/anvio/COGS-DATA-DIR/COG20/DB_DIAMOND/log.txt
Diamond search db ............................: /Users/meren/github/anvio/COGS-DATA-DIR/COG20/DB_DIAMOND/COG.dmnd
BLAST log ....................................: /Users/meren/github/anvio/COGS-DATA-DIR/COG20/DB_BLAST/log.txt
BLAST search db ..............................: /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpsermdulq


>>> anvi-setup-ncbi-cogs --cog-data-dir COGS-DATA-DIR -T 4 --just-do-it --cog-version COG14
COG version ..................................: COG14
COG data source ..............................: The command line parameter.
COG base directory ...........................: /Users/meren/github/anvio/COGS-DATA-DIR

WARNING
===============================================
This program will first check whether you have all the raw files, and then will
attempt to regenerate everything that is necessary from them.

Downloaded successfully ......................: /Users/meren/github/anvio/COGS-DATA-DIR/COG14/RAW_DATA_FROM_NCBI/cog2003-2014.csv
Downloaded successfully ......................: /Users/meren/github/anvio/COGS-DATA-DIR/COG14/RAW_DATA_FROM_NCBI/cognames2003-2014.tab
Downloaded successfully ......................: /Users/meren/github/anvio/COGS-DATA-DIR/COG14/RAW_DATA_FROM_NCBI/fun2003-2014.tab
Downloaded successfully ......................: /Users/meren/github/anvio/COGS-DATA-DIR/COG14/RAW_DATA_FROM_NCBI/prot2003-2014.fa.gz
Diamond log ..................................: /Users/meren/github/anvio/COGS-DATA-DIR/COG14/DB_DIAMOND/log.txt
Diamond search db ............................: /Users/meren/github/anvio/COGS-DATA-DIR/COG14/DB_DIAMOND/COG.dmnd
BLAST log ....................................: /Users/meren/github/anvio/COGS-DATA-DIR/COG14/DB_BLAST/log.txt
BLAST search db ..............................: /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmppsge8nbs


>>> ls COGS-DATA-DIR/
COG14  COG20

 >>> ls COGS-DATA-DIR/COG14/
CATEGORIES.txt  COG.txt  DB_BLAST  DB_DIAMOND  MISSING_COG_IDs.cPickle  PID-TO-CID.cPickle  RAW_DATA_FROM_NCBI

>>> ls COGS-DATA-DIR/COG20/
CATEGORIES.txt  COG.txt  DB_BLAST  DB_DIAMOND  MISSING_COG_IDs.cPickle  PID-TO-CID.cPickle  RAW_DATA_FROM_NCBI

meren avatar Jan 24 '21 18:01 meren

this file anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv has this line CTC_RS10785,GCF_000007625.1,WP_035109085.1,876,303-876,574,COG0749,COG0749,1,570.0,1.0e-200,593,31-593 SE133-174 AT984_RS20530,GCF_001477625.1,WP_082680220.1,384,1-121,121,COG0745,COG0745,3,112.0,3.14e-29,229,2-116

that makes the code break. the file on ftp://ftp.ncbi.nih.gov/pub/COG/COG2020/data doesnot have this id. i guess as a solution user can do 2 things

  1. wget directly from ftp
  2. update the code anvio/cogs.py to not break if the array index is not present.

malihaaziz avatar Mar 08 '21 21:03 malihaaziz

Hi @maziz2,

This looks like an issue specific to your download. When I look at my file, this is what I see:

grep -A 2 CTC_RS10785,GCF_000007625.1,WP_035109085.1,876,303-876,574,COG0749,COG0749,1,570.0,1.0e-200,593,31-593 anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI/cog-20.cog.csv

CTC_RS10785,GCF_000007625.1,WP_035109085.1,876,303-876,574,COG0749,COG0749,1,570.0,1.0e-200,593,31-593
SE1367,GCF_000007645.1,NP_764922.1,903,323-903,581,COG0749,COG0749,1,632.0,1.0e-200,593,2-593
CV_RS03810,GCF_000007705.1,WP_011134334.1,928,311-928,618,COG0749,COG0749,1,773.0,1.0e-200,593,2-593

Probably the file was corrupted during download and should be fixed if you re-run the program with the --reset flag.

Please let us know if you try that and succeed.

Best,

meren avatar Mar 09 '21 00:03 meren

Hi Dr. Eren

im not sure where my last post went.. anyways.. yes this is a corrupted download issue hence the reason why the users are successful after multiple tries.. wget from NCBI ftp didnt work for me .. its kept downloading corrupted files . I tried the rsync method which worked beautifully rsync --copy-links --times --verbose rsync://ftp.ncbi.nlm.nih.gov/etc etc. RAW_DATA_FROM_NCBI/

malihaaziz avatar Mar 09 '21 14:03 malihaaziz

That's very interesting, @maziz2. Thank you very much for the heads up.

meren avatar Mar 09 '21 15:03 meren

@maziz2, @meren As you mentioned I have downloaded the file (shows error all the time) by using the following command rsync --copy-links --times --verbose rsync://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/cog-20.fa.gz /home/ga214/miniconda3/envs/anvio-7/lib/python3.6/site-packages/anvio/data/misc/COG/COG20/RAW_DATA_FROM_NCBI But, I do not know how to format the downloaded files. please help me in this cause.

dineshkumarsrk avatar Apr 02 '21 05:04 dineshkumarsrk

You dont need to format. Anvio does it all.. The next step is to run the setup anvi-setup-ncbi-cogs --cog-version COG20 -T 8 --debug

malihaaziz avatar Apr 02 '21 16:04 malihaaziz

@maziz2 Thank you for your time and help. I followed your instruction but end-up with the following error,

(anvio-7) ga214@ga:~$ anvi-setup-ncbi-cogs --cog-version COG20 -T 14 --debug
COG version ..................................: COG20
COG data source ..............................: The anvi'o default.
COG base directory ...........................: /home/ga214/miniconda3/envs/anvio-7/lib/python3.6/site-packages/anvio/data/misc/COG

WARNING
===============================================
This program will first check whether you have all the raw files, and then will
attempt to regenerate everything that is necessary from them.

Press ENTER to continue, or press CTRL + C to cancel...

Traceback (most recent call last):
  File "/home/ga214/miniconda3/envs/anvio-7/bin/anvi-setup-ncbi-cogs", line 47, in <module>
    setup.create()
  File "/home/ga214/miniconda3/envs/anvio-7/lib/python3.6/site-packages/anvio/cogs.py", line 617, in create
    self.setup_raw_data()
  File "/home/ga214/miniconda3/envs/anvio-7/lib/python3.6/site-packages/anvio/cogs.py", line 831, in setup_raw_data
    self.files[file_name]['func'](file_path, J(self.COG_data_dir, self.files[file_name]['formatted_file_name']))
  File "/home/ga214/miniconda3/envs/anvio-7/lib/python3.6/site-packages/anvio/cogs.py", line 659, in format_p_id_to_cog_id_cPickle
    p_id = fields[2].replace('.', '_')
IndexError: list index out of range

I also tried this and got another error,

(anvio-7) ga214@ga:~$ anvi-setup-ncbi-cogs --cog-version /home/ga214/miniconda3/envs/anvio-7/lib/python3.6/site-packages/anvio/data/misc/COG/COG20 --debug

Traceback for debugging
================================================================================
  File "/home/ga214/miniconda3/envs/anvio-7/bin/anvi-setup-ncbi-cogs", line 46, in <module>
    setup = COGsSetup(args)
  File "/home/ga214/miniconda3/envs/anvio-7/lib/python3.6/site-packages/anvio/cogs.py", line 513, in __init__
    raise ConfigError(f"The COG versions known to anvi'o do not include '{self.COG_version}' :/ This is "
================================================================================


Config Error: The COG versions known to anvi'o do not include                              
              '/home/ga214/miniconda3/envs/anvio-7/lib/python3.6/site-                     
              packages/anvio/data/misc/COG/COG20' :/ This is what we know of: COG14, COG20.
              This is one of those things that should have never happened. We salute you.  

Could you please help me in this regard.

dineshkumarsrk avatar Apr 03 '21 05:04 dineshkumarsrk

@dineshkumarsrk and others: if you are willing to help with this error you can switch to the active branch (explained here), run,

anvi-setup-ncbi-cogs --reset

And follow the instructions in the error message.

meren avatar May 13 '21 21:05 meren

@dineshkumarsrk i requested NCBI to generate a checksum for all their files in COG folder https://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/data/checksums.md5.txt Please generate a checksum for the cog-20.fa.gz you downloaded and see if yours matches whats in the file

@meren it will be great if cogs.py downloads and matches the checksums before processing cog-20.fa.gz. i would have updated the codebase myself but im taking this machine learning course that is killing me insert exploding head with tears. i wont be able to test it thoroughly

malihaaziz avatar May 24 '21 03:05 malihaaziz

Hello,

I have been trying to set up a COG20 database using docker (the latest version of ANVIO). The problem was that "formatting protein ids to COG ids" was terminated about 80% of the process, as shown below. I am wondering what I should do to fix this problem.

Thank you very much,

Siripong

An error "formatting protein ids to COG ids" step:

**:: anvi'o v7.1_main_0522 :: /Users/siripongtongjai/ST_Bioinformatics/ST_ANVIO_Work/TEST_20221212_PanGenomics >>> anvi-setup-ncbi-cogs --cog-data-dir /Users/siripongtongjai/ST_Bioinformatics/ST_ANVIO_Work/TEST_20221212_PanGenomics/cogs-data/ --num-threads 12 --just-do-it COG version ..................................: COG20 COG data source ..............................: The command line parameter. COG base directory ...........................: /Users/siripongtongjai/ST_Bioinformatics/ST_ANVIO_Work/TEST_20221212_PanGenomics/cogs-data

WARNING

This program will first check whether you have all the raw files, and then will attempt to regenerate everything that is necessary from them.

[12 Dec 22 22:51:12 Formatting protein ids to COG ids file] 80.24% ETA: 37s Killed :: anvi'o v7.1_main_0522 :: /Users/siripongtongjai/ST_Bioinformatics/ST_ANVIO_Work/TEST_20221212_PanGenomics >>>**

sttongjai avatar Dec 12 '22 23:12 sttongjai

Hi @sttongjai,

This looks like a memory issue. It is possible that your docker containers are initiated with the default memory settings and you may need to increase max memory assigned to docker from the docker interface. Google should have good instructions for that :)

meren avatar Dec 13 '22 07:12 meren

Hi @meren,

Thank you very much for your advice. After increase the memory to 20GB, things seem to be improving. However, I managed to have a config error- 'Error -3 while decompressing data: invalid stored block lengths'- after making PID-TO-CID.cPickle, CATEGORIES.txt and COG.txt. Still missing MISSING_COG_IDs.cPickle.

Config Error: Something went wrong while decompressing the downloaded file :/ It is likely
that the download failed and only part of the file was downloaded. If you would like to try again, please run the setup command with the flag --reset. Here is what the downstream library said: 'Error -3 while decompressing data: invalid
stored block lengths'.

I am not sure what was the cause of this issue. Any suggestions?

Thank you very much for a speedy reply.

Siripong

sttongjai avatar Dec 13 '22 19:12 sttongjai

I get the same error when trying to set up the COG database. Did someone manage to fix it?

peygadin avatar Mar 14 '23 18:03 peygadin

I believe this was addressed with PRs #2110 and #2112. Anyone using anvi'o v8 or later has access to this fix. For those using an earlier version of anvi'o, the resolution to most issues with anvi-setup-ncbi-cogs is to simply re-run it until it works, as described in #1738 .

ivagljiva avatar Sep 29 '23 09:09 ivagljiva