run_dbcan
run_dbcan copied to clipboard
CGC-Finder only works with fna input
Hi there, thanks again for writing this useful tool!
I can run the CGC Finder without a problem with just fna files as input. However, when I tried providing my own protein sequences and the associated gff file, the CGC Finder output was blank. I tried using bed file as the [-c AuxillaryFile] and it still didn't work. I wonder if there is anything I am missing here.
Thank you!
Same problem here. I tried with the E.coli example :
run_dbcan.py --cluster E.coli.gff E.coli.faa protein --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.bed E.coli.faa protein --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.gff E.coli.fna meta --db_dir ~/Databases/db \ run_dbcan.py --cluster E.coli.bed E.coli.fna meta --db_dir ~/Databases/db
and the cgc.out file always comes empty
Hi,
Along those above lines, I was wondering whether it is possible to obtain cgc.out
based on a previous run_dbcan.py
output consisting of only:
diamond.out
hmmer.out
overview.txt
prodigal.gff
uniInput
That is, would there be an easy way to run only the following without re-running DIAMOND and HMMER?:
run_dbcan.py uniInput protein --out_dir output --cluster prodigal.gff --db_dir ~/databases/dbCAN/db
I will check the stuffs during weekends. Thank you all for your feedback
As for @lalalagartija, I end up with empty cgc.out
. Here's what I did, with the latest version of run_dbcan.py
:
- First run:
run_dbcan.py contigs.fasta prok \
--out_dir outdir \
-t diamond hmmer \
--db_dir ~/databases/dbCAN/db \
--hmm_cpu 12 \
--dia_cpu 12
Which produced the following files:
diamond.out
hmmer.out
overview.txt
prodigal.gff
uniInput
- Then run:
# copy the file away as re-running overwrites (I just wanted to check for this behavior)
mv outdir/diamond.out outdir/diamond_1.out
mv outdir/hmmer.out outdir/hmmer_1.out
mv outdir/overview.txt outdir/overview_1.txt
mv outdir/uniInput outdir/uniInput_1
run_dbcan.py uniInput_1 protein \
--out_dir outdir \
--db_dir ~/databases/dbCAN/db \
--hmm_cpu 16 \
--dia_cpu 16 \
--cluster outdir/prodigal.gff
Which produced:
17M Sep 11 22:42 cgc.gff
0 Sep 11 22:42 cgc.out
500K Sep 11 22:42 diamond_1.out
500K Sep 11 22:42 diamond.out
587K Sep 11 22:42 hmmer_1.out
587K Sep 11 22:42 hmmer.out
1.3M Sep 11 22:42 Hotpep.out
421K Sep 11 22:42 overview_1.txt
505K Sep 11 22:42 overview.txt
68M Sep 11 22:42 prodigal.gff
2.5M Sep 11 22:42 stp.out
626K Sep 11 22:42 tf-1.out
501K Sep 11 22:42 tf-2.out
2.1M Sep 11 22:42 tp.out
75M Sep 11 22:42 uniInput
75M Sep 11 22:42 uniInput_1
See the empty cgc.out
file.
Maybe it is wrong to re-run on the .gff file produced in the first run?
Thanks for the help and the great tool!
Hi @linnabrown. Just wondering if you think you will have time to look at this. Otherwise, I would dig in the code and try to adapt a solution for my needs :) Thank you and hope things are going well for you.
So strange. I used this command but I can have result in cgc.out
run_dbcan.py EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655
And I don't have this issue like yours when protein data is input.
Please try this command and let me know
python CGCFinder.py output_EscheriaColiK12MG1655/cgc.gff -o cgc.out -s tp -d 2
I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command:
run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0
Finally I got my results. Hope this feedback is helpful.
@iaunicorn Thank you for the hint!
Indeed, in my .gff file the sequences IDs are labeled as such
NODE_1_length_998062_cov_18.990727 Prodigal_v2.6.3 CDS 30 830 52.0 + 0 ;ID=1_1
whereas these IDs in the fasta file, the proteins are labeled with the ID# appended:
>NODE_1_length_998062_cov_18.990727_1 # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366
Did you edit your proteins.faa
input so that the fasta header match exactly the .gff entries?
i.e. in my case, leaving only >NODE_1_length_998062_cov_18.990727_1
(and without # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366
)?
Prefer asking in advance as my input is very large... Thanks!
@linnabrown I did not get the chance to run the test commands you shared but I will to verify that my issue could be the proteins naming. Thanks again!
Hi,
Thanks for using our tool. Our gff file should be gff3 format.
Best, Le
From: Lejzerowicz [email protected] Date: Monday, November 23, 2020 at 1:03 PM To: linnabrown/run_dbcan [email protected] Cc: Huang, Le [email protected], Mention [email protected] Subject: Re: [linnabrown/run_dbcan] CGC-Finder only works with fna input (#50)
@iaunicornhttps://github.com/iaunicorn Thank you for the hint!
Indeed, in my .gff file the sequences IDs are labeled as such
NODE_1_length_998062_cov_18.990727 Prodigal_v2.6.3 CDS 30 830 52.0 + 0 ;ID=1_1
whereas these IDs in the fasta file, the proteins are labeled with the ID# appended:
NODE_1_length_998062_cov_18.990727_1 # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366
Did you edit your proteins.faa input so that the fasta header match exactly the .gff entries? i.e. in my case, leaving only >NODE_1_length_998062_cov_18.990727_1 (and without # 30 # 830 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.366 )? Prefer asking in advance as my input is very large... Thanks!
@linnabrownhttps://github.com/linnabrown I did not get the chance to run the test commands you shared but I will to verify that my issue could be the proteins naming. Thanks again!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/50#issuecomment-732330139, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMHALVIHR5TAX6JFC7AECDSRKPX5ANCNFSM4OHAQIQA.
OK, it seems I have been using Prodigal V2.6.3
, while GFF3 might only be an output of Prodigal.v3.0.0
...
Weird because I created a new conda env with it, following the install procedure conda create -n run_dbcan python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda
.
I can look it up, but do you know whether it is possible to convert gff to gff3 format? If yes, do you know of some scripts doing this?
Thanks.
http://www.sequenceontology.org/cgi-bin/converter.cgi
From: Lejzerowicz [email protected] Date: Monday, November 23, 2020 at 1:17 PM To: linnabrown/run_dbcan [email protected] Cc: Huang, Le [email protected], Mention [email protected] Subject: Re: [linnabrown/run_dbcan] CGC-Finder only works with fna input (#50)
OK, it seems I have been using Prodigal V2.6.3, while GFF3 might only be an output of Prodigal.v3.0.0... Weird because I created a new conda env with it, following the install procedure conda create -n run_dbcan python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda.
I can look it up, but do you know whether it is possible to convert gff to gff3 format? If yes, do you know of some scripts doing this?
Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/linnabrown/run_dbcan/issues/50#issuecomment-732337997, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACMHALSFQZWBFKWRU7E6L2DSRKRKNANCNFSM4OHAQIQA.
Answering my own question. To convert GFF to GFF3;
https://bmi.inf.ethz.ch/supplements/gff-tools
Best, F
@FranckLejzerowicz Yes, I edited my proteins.faa input.
Edit
See issue #78
Hello @linnabrown. Sorry to resurrect this old issue. I'm having the same problems with run_dbcan v.2.0.11, installed via conda following your instructions. Here are my commands and outputs:
installation
source /home/miniconda3/bin/activate
conda create -n dbcan2 python=3.8 diamond hmmer prodigal -c conda-forge -c bioconda
conda activate dbcan2
pip install run-dbcan==2.0.11
Databases were installed as outlined in the README.
.faa & .gff test
export OMP_NUM_THREADS=8
faa=/home/dbcan2/EscheriaColiK12MG1655.faa
gff=/home/dbcan2/EscheriaColiK12MG1655.gff
OUT=/workdir/dbcan2
mkdir -p $OUT
cd $OUT
source /home/miniconda3/bin/activate
conda activate dbcan2
run_dbcan.py $faa protein \
--out_dir test_faa \
-c $gff \
-t all \
--dia_cpu $OMP_NUM_THREADS \
--hmm_cpu $OMP_NUM_THREADS \
--hotpep_cpu $OMP_NUM_THREADS \
--tf_cpu $OMP_NUM_THREADS \
--db_dir /home/dbcan2/db
output
cgc.gff cgc.out overview.txt stp.out tf-1.out tf-2.out tp.out uniInput
cgc.out
is empty, and overview.txt
only includes column names.
.fna test
export OMP_NUM_THREADS=8
fna=/home/dbcan2/EscheriaColiK12MG1655.fna
gff=/home/dbcan2/EscheriaColiK12MG1655.gff
OUT=/workdir/dbcan2
mkdir -p $OUT
cd $OUT
source /home/miniconda3/bin/activate
conda activate dbcan2
run_dbcan.py $fna prok \
--out_dir test_fna \
-c $gff \
-t all \
--dia_cpu $OMP_NUM_THREADS \
--hmm_cpu $OMP_NUM_THREADS \
--hotpep_cpu $OMP_NUM_THREADS \
--tf_cpu $OMP_NUM_THREADS \
--db_dir /home/dbcan2/db
output
cgc.gff cgc.out overview.txt prodigal.gff stp.out tf-1.out tf-2.out tp.out uniInput
As with the protein
method, cgc.out
and overview.txt
are empty. Running the prok
method without the -c $gff
option gives fewer files and the same empty overview.txt
.
The log files for both runs report zero errors. I've also attempted to run your program with the meta
parameter using three different inputs: (1) metagenomic contigs, (2) metagenomic genes (.fna) annotated using prokka, and (3) metagenomic proteins (.faa and .gff3) annotated using prokka. As with the E. coli test files, the output included empty files and no clear errors.
Concerning your comment above...
So strange. I used this command but I can have result in cgc.out
run_dbcan.py EscheriaColiK12MG1655.fna prok -c cluster --out_dir output_EscheriaColiK12MG1655
What is the purpose of your use of -c
with an .fna input? Your README seems to suggest -c
should only be invoked with the proteome
option:
[-c AuxillaryFile]- optional, include to enable CGCFinder. If using a proteome input, the AuxillaryFile must be a GFF or BED format file containing gene positioning information. Otherwise, the AuxillaryFile may be left blank.
I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command:
run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0
Finally I got my results. Hope this feedback is helpful.
@iaunicorn can you expand on what you mean by "each protein has the same name in the corresponding gff file"? Do you mean that each defline of the .faa file has a matching Name=
attribute in field 9 of the .gff file? Is it the ID=
attribute that needs to be changed, or protein_id=
? Could you give an example of the correct format for a .faa+.gff pair?
I also had the same problem here, and the cgc.out file came empty. Then I checked my protein sequences and made sure that each protein has the same name in the corresponding gff file. Then I ran the following command: run_dbcan.py ./proteins.faa protein -c ./my.gff --out_dir ./output0 Finally I got my results. Hope this feedback is helpful.
@iaunicorn can you expand on what you mean by "each protein has the same name in the corresponding gff file"? Do you mean that each defline of the .faa file has a matching
Name=
attribute in field 9 of the .gff file? Is it theID=
attribute that needs to be changed, orprotein_id=
? Could you give an example of the correct format for a .faa+.gff pair?
@acvill Yes, I mean each defline of the .faa file has a matching Name=
attribute. Please see the following example:
###.faa file
TrA0997W MSLPVRTRDEGDDEGPPTSGLVDREEPSESLVGHEGRLEDGMLLRDFEERTTFYDAVAELQMTQTEAKLF ...... ###.gff file scaffold_1 JGI exon 3056341 3056677 . + . Name=TrA0997W;transcriptId=110090 scaffold_1 JGI CDS 3056341 3056677 . + 0 Name=TrA0997W;proteinId=109692;exonNumber=1;product_name="" ......
I've found the issue with run_dbcan 2.0.11. Adding the optional -t all
parameter causes the program to break. Exclusion of -t
results in the correct output using the provided test files for both prok
and protein
inputs.
Works
run_dbcan.py EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff --out_dir out_protein --db_dir /home/dbcan2/db
run_dbcan.py EscheriaColiK12MG1655.fna prok --out_dir out_prok --db_dir /home/dbcan2/db
Does not work
run_dbcan.py EscheriaColiK12MG1655.faa protein -c EscheriaColiK12MG1655.gff -t all --out_dir out_protein --db_dir /home/dbcan2/db
run_dbcan.py EscheriaColiK12MG1655.fna prok -t all --out_dir out_prok --db_dir /home/dbcan2/db
I will open a new issue referencing this thread.