run_dbcan icon indicating copy to clipboard operation
run_dbcan copied to clipboard

CGC-Finder only works with fna input

Open szzxpxf opened this issue 4 years ago • 20 comments

Hi there, thanks again for writing this useful tool!

I can run the CGC Finder without a problem with just fna files as input. However, when I tried providing my own protein sequences and the associated gff file, the CGC Finder output was blank. I tried using bed file as the [-c AuxillaryFile] and it still didn't work. I wonder if there is anything I am missing here.

Thank you!

szzxpxf avatar Jun 24 '20 18:06 szzxpxf

Same problem here. I tried with the E.coli example : run_dbcan.py --cluster E.coli.gff E.coli.faa protein --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.bed E.coli.faa protein --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.gff E.coli.fna meta --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.bed E.coli.fna meta --db_dir ~/Databases/db

and the cgc.out file always comes empty

lalalagartija avatar Aug 12 '20 15:08 lalalagartija

Hi, Along those above lines, I was wondering whether it is possible to obtain cgc.out based on a previous run_dbcan.py output consisting of only:

diamond.out
hmmer.out
overview.txt
prodigal.gff
uniInput

That is, would there be an easy way to run only the following without re-running DIAMOND and HMMER?:

run_dbcan.py uniInput protein --out_dir output --cluster prodigal.gff --db_dir ~/databases/dbCAN/db

FranckLejzerowicz avatar Sep 10 '20 21:09 FranckLejzerowicz

I will check the stuffs during weekends. Thank you all for your feedback

linnabrown avatar Sep 10 '20 21:09 linnabrown

As for @lalalagartija, I end up with empty cgc.out. Here's what I did, with the latest version of run_dbcan.py:

  • First run:
run_dbcan.py contigs.fasta prok \
    --out_dir outdir \
    -t diamond hmmer \
    --db_dir ~/databases/dbCAN/db \
    --hmm_cpu 12 \
    --dia_cpu 12

Which produced the following files:

diamond.out
hmmer.out
overview.txt
prodigal.gff
uniInput
  • Then run:
# copy the file away as re-running overwrites (I just wanted to check for this behavior)
mv outdir/diamond.out outdir/diamond_1.out
mv outdir/hmmer.out outdir/hmmer_1.out
mv outdir/overview.txt outdir/overview_1.txt
mv outdir/uniInput outdir/uniInput_1

run_dbcan.py uniInput_1 protein  \
    --out_dir outdir \
    --db_dir ~/databases/dbCAN/db \
    --hmm_cpu 16 \
    --dia_cpu 16 \
    --cluster outdir/prodigal.gff

Which produced:

  17M Sep 11 22:42 cgc.gff
    0 Sep 11 22:42 cgc.out
 500K Sep 11 22:42 diamond_1.out
 500K Sep 11 22:42 diamond.out
 587K Sep 11 22:42 hmmer_1.out
 587K Sep 11 22:42 hmmer.out
 1.3M Sep 11 22:42 Hotpep.out
 421K Sep 11 22:42 overview_1.txt
 505K Sep 11 22:42 overview.txt
  68M Sep 11 22:42 prodigal.gff
 2.5M Sep 11 22:42 stp.out
 626K Sep 11 22:42 tf-1.out
 501K Sep 11 22:42 tf-2.out
 2.1M Sep 11 22:42 tp.out
  75M Sep 11 22:42 uniInput
  75M Sep 11 22:42 uniInput_1

See the empty cgc.out file.

Maybe it is wrong to re-run on the .gff file produced in the first run?

Thanks for the help and the great tool!

FranckLejzerowicz avatar Sep 12 '20 19:09 FranckLejzerowicz

Hi @linnabrown. Just wondering if you think you will have time to look at this. Otherwise, I would dig in the code and try to adapt a solution for my needs :) Thank you and hope things are going well for you.

FranckLejzerowicz avatar Oct 23 '20 20:10 FranckLejzerowicz

So strange. I used this command but I can have result in cgc.out

run_dbcan.py EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655

linnabrown avatar Oct 27 '20 03:10 linnabrown

And I don't have this issue like yours when protein data is input.

linnabrown avatar Oct 27 '20 04:10 linnabrown

Please try this command and let me know

python CGCFinder.py output_EscheriaColiK12MG1655/cgc.gff -o cgc.out -s tp -d 2

linnabrown avatar Oct 27 '20 04:10 linnabrown

I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command:

run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0

Finally I got my results. Hope this feedback is helpful.

iaunicorn avatar Nov 23 '20 16:11 iaunicorn

@iaunicorn Thank you for the hint!

Indeed, in my .gff file the sequences IDs are labeled as such

NODE_1_length_998062_cov_18.990727	Prodigal_v2.6.3	CDS	30	830	52.0	+	0	;ID=1_1

whereas these IDs in the fasta file, the proteins are labeled with the ID# appended:

>NODE_1_length_998062_cov_18.990727_1 # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366

Did you edit your proteins.faa input so that the fasta header match exactly the .gff entries? i.e. in my case, leaving only >NODE_1_length_998062_cov_18.990727_1 (and without # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366 )? Prefer asking in advance as my input is very large... Thanks!

@linnabrown I did not get the chance to run the test commands you shared but I will to verify that my issue could be the proteins naming. Thanks again!

FranckLejzerowicz avatar Nov 23 '20 18:11 FranckLejzerowicz

Hi,

Thanks for using our tool. Our gff file should be gff3 format.

Best, Le

From: Lejzerowicz [email protected] Date: Monday, November 23, 2020 at 1:03 PM To: linnabrown/run_dbcan [email protected] Cc: Huang, Le [email protected], Mention [email protected] Subject: Re: [linnabrown/run_dbcan] CGC-Finder only works with fna input (#50)

@iaunicornhttps://github.com/iaunicorn Thank you for the hint!

Indeed, in my .gff file the sequences IDs are labeled as such

NODE_1_length_998062_cov_18.990727 Prodigal_v2.6.3 CDS 30 830 52.0 + 0 ;ID=1_1

whereas these IDs in the fasta file, the proteins are labeled with the ID# appended:

NODE_1_length_998062_cov_18.990727_1 # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366

Did you edit your proteins.faa input so that the fasta header match exactly the .gff entries? i.e. in my case, leaving only >NODE_1_length_998062_cov_18.990727_1 (and without # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366 )? Prefer asking in advance as my input is very large... Thanks!

@linnabrownhttps://github.com/linnabrown I did not get the chance to run the test commands you shared but I will to verify that my issue could be the proteins naming. Thanks again!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/50#issuecomment-732330139, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMHALVIHR5TAX6JFC7AECDSRKPX5ANCNFSM4OHAQIQA.

linnabrown avatar Nov 23 '20 18:11 linnabrown

OK, it seems I have been using Prodigal V2.6.3, while GFF3 might only be an output of Prodigal.v3.0.0... Weird because I created a new conda env with it, following the install procedure conda create -n run_dbcan python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda.

I can look it up, but do you know whether it is possible to convert gff to gff3 format? If yes, do you know of some scripts doing this?

Thanks.

FranckLejzerowicz avatar Nov 23 '20 18:11 FranckLejzerowicz

http://www.sequenceontology.org/cgi-bin/converter.cgi

From: Lejzerowicz [email protected] Date: Monday, November 23, 2020 at 1:17 PM To: linnabrown/run_dbcan [email protected] Cc: Huang, Le [email protected], Mention [email protected] Subject: Re: [linnabrown/run_dbcan] CGC-Finder only works with fna input (#50)

OK, it seems I have been using Prodigal V2.6.3, while GFF3 might only be an output of Prodigal.v3.0.0... Weird because I created a new conda env with it, following the install procedure conda create -n run_dbcan python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda.

I can look it up, but do you know whether it is possible to convert gff to gff3 format? If yes, do you know of some scripts doing this?

Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/50#issuecomment-732337997, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMHALSFQZWBFKWRU7E6L2DSRKRKNANCNFSM4OHAQIQA.

linnabrown avatar Nov 23 '20 18:11 linnabrown

Answering my own question. To convert GFF to GFF3;

https://bmi.inf.ethz.ch/supplements/gff-tools

Best, F

FranckLejzerowicz avatar Nov 23 '20 18:11 FranckLejzerowicz

@FranckLejzerowicz Yes, I edited my proteins.faa input.

iaunicorn avatar Nov 23 '20 18:11 iaunicorn

Edit

See issue #78


Hello @linnabrown. Sorry to resurrect this old issue. I'm having the same problems with run_dbcan v.2.0.11, installed via conda following your instructions. Here are my commands and outputs:

installation

source /home/miniconda3/bin/activate
conda create -n dbcan2 python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda
conda activate dbcan2
pip install run-dbcan==2.0.11

Databases were installed as outlined in the README.

.faa & .gff test

export OMP_NUM_THREADS=8
faa=/home/dbcan2/EscheriaColiK12MG1655.faa
gff=/home/dbcan2/EscheriaColiK12MG1655.gff
OUT=/workdir/dbcan2
mkdir -p $OUT
cd $OUT

source /home/miniconda3/bin/activate
conda activate dbcan2

run_dbcan.py $faa protein \
       --out_dir test_faa \
       -c $gff \
       -t all \
       --dia_cpu $OMP_NUM_THREADS \
       --hmm_cpu $OMP_NUM_THREADS \
       --hotpep_cpu $OMP_NUM_THREADS \
       --tf_cpu $OMP_NUM_THREADS \
       --db_dir /home/dbcan2/db

output

cgc.gff  cgc.out  overview.txt  stp.out  tf-1.out  tf-2.out  tp.out  uniInput

cgc.out is empty, and overview.txt only includes column names.

.fna test

export OMP_NUM_THREADS=8
fna=/home/dbcan2/EscheriaColiK12MG1655.fna
gff=/home/dbcan2/EscheriaColiK12MG1655.gff
OUT=/workdir/dbcan2
mkdir -p $OUT
cd $OUT

source /home/miniconda3/bin/activate
conda activate dbcan2

run_dbcan.py $fna prok \
       --out_dir test_fna \
       -c $gff \
       -t all \
       --dia_cpu $OMP_NUM_THREADS \
       --hmm_cpu $OMP_NUM_THREADS \
       --hotpep_cpu $OMP_NUM_THREADS \
       --tf_cpu $OMP_NUM_THREADS \
       --db_dir /home/dbcan2/db

output

cgc.gff  cgc.out  overview.txt  prodigal.gff  stp.out  tf-1.out  tf-2.out  tp.out  uniInput

As with the protein method, cgc.out and overview.txt are empty. Running the prok method without the -c $gff option gives fewer files and the same empty overview.txt.

The log files for both runs report zero errors. I've also attempted to run your program with the meta parameter using three different inputs: (1) metagenomic contigs, (2) metagenomic genes (.fna) annotated using prokka, and (3) metagenomic proteins (.faa and .gff3) annotated using prokka. As with the E. coli test files, the output included empty files and no clear errors.


Concerning your comment above...

So strange. I used this command but I can have result in cgc.out

run_dbcan.py EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655

What is the purpose of your use of -c with an .fna input? Your README seems to suggest -c should only be invoked with the proteome option:

[-c AuxillaryFile]- optional, include to enable CGCFinder. If using a proteome input, the AuxillaryFile must be a GFF or BED format file containing gene positioning information. Otherwise, the AuxillaryFile may be left blank.

acvill avatar Sep 30 '21 17:09 acvill

I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command:

run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0

Finally I got my results. Hope this feedback is helpful.

@iaunicorn can you expand on what you mean by "each protein has the same name in the corresponding gff file"? Do you mean that each defline of the .faa file has a matching Name= attribute in field 9 of the .gff file? Is it the ID= attribute that needs to be changed, or protein_id=? Could you give an example of the correct format for a .faa+.gff pair?

acvill avatar Sep 30 '21 19:09 acvill

I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command: run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0 Finally I got my results. Hope this feedback is helpful.

@iaunicorn can you expand on what you mean by "each protein has the same name in the corresponding gff file"? Do you mean that each defline of the .faa file has a matching Name= attribute in field 9 of the .gff file? Is it the ID= attribute that needs to be changed, or protein_id=? Could you give an example of the correct format for a .faa+.gff pair?

@acvill Yes, I mean each defline of the .faa file has a matching Name= attribute. Please see the following example: ###.faa file

TrA0997W MSLPVRTRDEGDDEGPPTSGLVDREEPSESLVGHEGRLEDGMLLRDFEERTTFYDAVAELQMTQTEAKLF ...... ###.gff file scaffold_1 JGI exon 3056341 3056677 . + . Name=TrA0997W;transcriptId=110090 scaffold_1 JGI CDS 3056341 3056677 . + 0 Name=TrA0997W;proteinId=109692;exonNumber=1;product_name="" ......

iaunicorn avatar Oct 01 '21 11:10 iaunicorn

I've found the issue with run_dbcan 2.0.11. Adding the optional -t all parameter causes the program to break. Exclusion of -t results in the correct output using the provided test files for both prok and protein inputs.

Works

run_dbcan.py EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff --out_dir out_protein --db_dir /home/dbcan2/db
run_dbcan.py EscheriaColiK12MG1655.fna prok --out_dir out_prok --db_dir /home/dbcan2/db

Does not work

run_dbcan.py EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff -t all --out_dir out_protein --db_dir /home/dbcan2/db
run_dbcan.py EscheriaColiK12MG1655.fna prok -t all --out_dir out_prok --db_dir /home/dbcan2/db

I will open a new issue referencing this thread.

acvill avatar Oct 01 '21 20:10 acvill