run_dbcan Diamond+CAZY analysis for mRNA reads

Diamond+CAZY analysis for mRNA reads

Open sumitra20 opened this issue 1 year ago • 6 comments

Dear Developers,

I am trying to do CAZymes analysis using my mRNA reads. I downloaded a copy of the database (CAZyDB.08062022.fa), indexed it to Diamond format, and then performed a diamond blast. I've also downloaded the mapping file (CAZyDB.08062022.fam.subfam.ec.txt) but now I'm really confused about how to proceed with the annotation and getting the annotation summary out from the generated diamond output. Can I perform it using the MEGAN6 tool? I'd really appreciate it if you could give me some ideas on how to proceed. I'm very new to bioinformatic analysis and I'm struggling to understand the analysis pipelines.

#download CaZyme databse
wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fa

#index it to diamond format
/home/combio7/diamond-2.0.12/bin/diamond makedb --in ./CAZyDB.08062022.fa -d CAZy

#diamond_blast

/home/combio7/diamond-2.0.12/bin/diamond blastx -d CAZy.dmnd -q /media/combio7/8TB_Backup/sumitra_130122/cycle_5_30822/5_sortme_cycle5/R33P5_R_sortme_non_rRNA.fq -o ./R33P5_R_cazyme.out -b 20.0 -f 6 -e 1e-3

#donwloading mapping file (not sure which to use)
wget https://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-old@UGA/CAZyDB-ec-info.txt.07-20-2017

wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fam.subfam.ec.txt

Thank you

Sep 06 '22 14:09 sumitra20

You can use the 2022 ec file. As to your question on what's next, that depends on your research aims. Very often people will take the diamond output to calculate an abundance (e.g., FPKM) value for each cazyme family. The cazyme family can be connected to EC and substrates, so that they can answer questions, e.g., what families are more highly expressed than others in what conditions/samples, or what substrate degradation is more active, etc. I've never used MEGAN, but I know it can compute the abundance info. Other choices include HUMANN3, but I don't know if they have CAZyme database integrated in their pipeline. You can look into https://github.com/AnantharamanLab/METABOLIC, which has dbCAN integrated.

Yanbin

Sep 07 '22 18:09 yinlabniu

Hi Yanbin,

Thank you so much for the suggestions given. I'll look into METABOLIC as well.

Sep 10 '22 07:09 sumitra20

Hi @yinlabniu

hi @yinlabniu ,

Sorry, i might have another question. As i mentioned before i ran a blastx for my nucleotide sequence against the CAZY database and below is the snapshot of how the output looks like

However, when i try analyzing my .fasta (nucleotide) file using run_dbcan tool i dont seem to get any put for diamond, i only get output for eCAMI. I have tried running with both -meta and -prok. Any idea why?

run_dbcan /media/combio7/8TB_Backup/sumitra_130122/CAZyme/megan6_output/bacteria_archea_R11P4_BS_phylum.fasta meta --out_dir output2_megan -t diamond eCAMI --dia_cpu 8

run_dbcan /media/combio7/8TB_Backup/sumitra_130122/CAZyme/megan6_output/bacteria_archea_R11P4_BS_phylum.fasta prok --out_dir output23_megan -t diamond eCAMI --dia_cpu 8

output:

Thank you

Sep 27 '22 06:09 sumitra20

Hi @sumitra20 . If there is no output diamond column, it means that your input sequences do not have matched annotations from databases searched against by DIAMOND software.

This command is quite right. You can try this command in the example and it prompts that our program is only running diamond and eCAMI software currently.

run_dbcan EscheriaColiK12MG1655.fna prok --out_dir output_EscheriaColiK12MG1655 -t diamond eCAMI


***************************1. DIAMOND start*************************************************


diamond v2.0.15.153 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 4
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory: output_EscheriaColiK12MG1655
#Target sequences to report alignments for: 1
Opening the database...  [0.078s]
Database: db/CAZy.dmnd (type: Diamond database, sequences: 2161786, letters: 1024499689)
Block size = 2000000000
Opening the input file...  [0s]
Opening the output file...  [0s]
Loading query sequences...  [0.004s]
Masking queries...  [0.012s]
Building query seed set...  [0.067s]
Algorithm: Query-indexed
Building query histograms...  [0.001s]
Allocating buffers...  [0s]
Loading reference sequences...  [1.649s]
Initializing temporary storage...  [0s]
Building reference histograms...  [3.259s]
Allocating buffers...  [0s]
Processing query block 1, reference block 1/1, shape 1/2.
Building reference seed array...  [1.902s]
Building query seed array...  [0.002s]
Computing hash join...  [0.363s]
Searching alignments...  [2.014s]
Processing query block 1, reference block 1/1, shape 2/2.
Building reference seed array...  [1.811s]
Building query seed array...  [0.002s]
Computing hash join...  [0.349s]
Searching alignments...  [2.235s]
Deallocating buffers...  [0.055s]
Clearing query masking...  [0s]
Computing alignments...  [4.846s]
Deallocating reference...  [0.025s]
Loading reference sequences...  [0s]
Deallocating buffers...  [0s]
Deallocating queries...  [0s]
Loading query sequences...  [0s]
Closing the input file...  [0s]
Closing the output file...  [0s]
Cleaning up...  [0s]
Total time = 18.736s
Reported 197 pairwise alignments, 197 HSPs.
197 queries aligned.


***************************1. DIAMOND end***************************************************




***************************3. eCAMI start***************************************************


Using CAZyme db in eCAMI
total time:346.363926s


***************************3. eCAMI end***************************************************


Preparing overview table from hmmer, eCAMI and diamond output...
overview table complete. Saved as output_EscheriaColiK12MG1655/overview.txt

Sep 27 '22 20:09 linnabrown

Hey @yinlabniu ,

Yes, running the test file EscheriaColiK12MG1655.fna works. But i just find it strange as to why i can get diamond annotation output when i run

diamond blastx -q /R33P5_R_sortme_non_rRNA.fq -o ./R33P5_R_cazyme.out -b 20.0 -f 6 -e 1e-3 But no any matched annotation when i use the exact same input file with run_dbcan. Is it bcs of the difference in databse used?

Sep 28 '22 00:09 sumitra20

It could be the evalue. You used 1e-3, but within run_dbcan we have set the default evalue 1e-102. Since you are working with short reads, the default is certainly too stringent.

Yanbin

From: sumitra20 @.> Sent: Tuesday, September 27, 2022 7:34 PM To: linnabrown/run_dbcan @.> Cc: Yanbin Yin @.>; Mention @.> Subject: Re: [linnabrown/run_dbcan] Diamond+CAZY analysis for mRNA reads (Issue #103)

Non-NU Email

Hey @yinlabniuhttps://urldefense.com/v3/__https://github.com/yinlabniu__;!!PvXuogZ4sRB2p-tU!EYm1OEj6mvztKleFSVl-Xy_o2d3Lui7lmksd9OZXvKKp9RuBY-1GoviJegc6a0CRG3NKyTvEdkPRGGIkYkHKnA$ ,

Yes, running the test file EscheriaColiK12MG1655.fna works. But i just find it strange as to why i can get diamond annotation output when i run

diamond blastx -q /R33P5_R_sortme_non_rRNA.fq -o ./R33P5_R_cazyme.out -b 20.0 -f 6 -e 1e-3 But no any matched annotation when i use the exact same input file with run_dbcan. Is it bcs of the difference in databse used?

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/linnabrown/run_dbcan/issues/103*issuecomment-1260240789__;Iw!!PvXuogZ4sRB2p-tU!EYm1OEj6mvztKleFSVl-Xy_o2d3Lui7lmksd9OZXvKKp9RuBY-1GoviJegc6a0CRG3NKyTvEdkPRGGK2KxJI0A$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEXNKZTK4OTPI2DGJ7Z22C3WAOHCVANCNFSM6AAAAAAQF4HWM4__;!!PvXuogZ4sRB2p-tU!EYm1OEj6mvztKleFSVl-Xy_o2d3Lui7lmksd9OZXvKKp9RuBY-1GoviJegc6a0CRG3NKyTvEdkPRGGIHeHhcqw$. You are receiving this because you were mentioned.Message ID: @.***>

Sep 28 '22 02:09 yinlabniu

run_dbcan run_dbcan copied to clipboard

Diamond+CAZY analysis for mRNA reads

run_dbcan
run_dbcan copied to clipboard