anvio icon indicating copy to clipboard operation
anvio copied to clipboard

[BUG] Most annotations from anvi-run-cazymes have undefined ('-') accession numbers

Open ivagljiva opened this issue 2 years ago • 5 comments

Short description of the problem

When we annotate a contigs database with anvi-run-cazymes, most of the resulting hits in the gene functions table have - for an accession value:

image

It is not an issue coming from our code, but from the structure of the CAZyme HMM profiles, because we can see these undefined accessions in the hmmscan output when we run the program with --debug and check the temp output files:

$ head /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmpiuhuh68p/hmm.table.fixed
438	-	CBM32.hmm	-	2.2e-34	114.8	1.7	1.6e-21	73.2	0.6	2.4	2	0	0	2	2	2	2	-
847	-	CBM32.hmm	-	1.7e-33	111.9	0.0	5e-21	71.6	0.0	2.6	2	0	0	2	2	2	2	-
871	-	CBM32.hmm	-	4.6e-27	91.1	0.3	1.5e-26	89.4	0.3	1.9	1	0	0	1	1	1	1	-
346	-	CBM32.hmm	-	3.4e-26	88.3	7.5	1.8e-25	86.0	2.8	2.6	2	0	0	2	2	2	1	-
892	-	CBM32.hmm	-	2.1e-25	85.7	0.0	4.5e-25	84.7	0.0	1.5	1	0	0	1	1	1	1	-
1069	-	CBM32.hmm	-	1.3e-14	50.9	5.2	2.4e-12	43.6	0.1	3.9	2	1	1	3	3	3	1	-

A related issue is that the function definition column contains the enzyme class names rather than the actual annotations (ie, CBM32.hmm when it would be more useful to see Carbohydrate-Binding Module Family 32, or GH73.hmm when it would be more useful to see Glycoside Hydrolase Family 73). With the '.hmm' extension after the class ID number, these also look like filenames rather than annotations.

The lack of unique accessions is a problem for anyone who wants to run anvi-estimate-metabolism with user-defined pathways using the CAZy database as a functional annotation source, because that requires a unique accession number to match each enzyme annotation to its pathway. It may also affect other downstream programs that rely on accession numbers like anvi-display-functions.

Expected behavior and suggested solution

In my opinion, the expected behavior here would be to use the ID number of the CAZy class (ie, GH73) as the accession number for these annotations to make them 1) usable/searchable in downstream applications and 2) consistent across all CAZyme annotations.

In practice, it is more complicated because some of the profiles seem to have accessions already - for instance, the profile for GT2_Glycos_transf_2 has the accession PF00535.25 (which seems to be its corresponding Pfam accession number), and the closely related GT2_Glyco_tranf_2_2 profile has the corresponding Pfam accession PF10111.8, etc. But since those accessions are coming from different databases (ie, Pfam, not CAZy), I think we should change every single CAZyme annotation to use the CAZy ID number as an accession, and if there is already an accession in place from Pfam or wherever, we can append that alternative accession to the end of the function definintion string.

Second, having the HMM profile filename as the function definition is completely useless. We should replace it with the actual annotation that gives people a better idea of what the protein is doing rather than forcing them to go look up the CAZy class online.

Here is an example of what I suggest the CAZyme annotations to look like, in the case that there is not an existing accession number:

accession function
CBM32 Carbohydrate-Binding Module Family 32
GH73 Glycoside Hydrolase Family 73

And here is what they would look like in the case where there is an existing accession number:

accession function
GT2_Glycos_transf_2 GlycosylTransferase Family 2 (PF00535.25)
GT2_Glyco_tranf_2_2 GlycosylTransferase Family 2 (PF10111.8)
GT2_Glyco_tranf_2_3 GlycosylTransferase Family 2 (PF13641.5)

Since the CAZyme HMM profiles seem to be not set up very nicely, I guess the best way to implement these changes would be:

  1. find some way to map the CAZy class ID to its full definition, probably by creating a file during the runtime of anvi-setup-cazymes that could be later read into a dictionary during anvi-run-cazymes for creating the definition string
  2. do some post-processing of the HMMER results in cazyme.py to a) parse the current 'definition' to remove the '.hmm' extension and set that as the 'accession' instead, b) match that accession to the human-readable name of the CAZy class to make the new 'definition' string and c) append any existing accession to the end of the 'definition' string

@mschecht , I would like to hear what you think about this. I am happy to work on implementing the solution if you don't currently have the bandwidth. :)

I was hoping to use CAZymes in a user-defined pathway in my upcoming tutorial, which at this point is sadly impossible, but regardless, it would be nice to make this annotation source usable with the metabolism framework by our next minor release.

anvi'o version

Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.13

Current version of CAZy database is V11.

System info

macOSX Sonoma 14.0

ivagljiva avatar Oct 15 '23 13:10 ivagljiva

One more thing that I noticed is that there is no sanity check for re-running anvi-run-cazymes on a database that has already been annotated with CAZy, which means that existing annotations are automatically overwritten. This is a separate issue, but could be addressed alongside those mentioned above.

ivagljiva avatar Oct 15 '23 14:10 ivagljiva

Hi @ivagljiva ,

The CAZy hmm profiles use the NAME field for both name and for the CAZy accession number. Tbh, there isn't much more that you can get out of the CAZy families given their breadth.

They also leave the ACC and DESC fields for cross-references - only 9 profiles that were incorporated into Pfam in the current V12.

xvazquezc avatar Oct 15 '23 23:10 xvazquezc

Thanks for the insight @xvazquezc . I think my recommendations for post-processing of the hits still make sense given that information. If we are annotating our gene calls with CAZy, we want the CAZy accession numbers to be in the accession field regardless of whether the ACC field in the profile is used for cross-referencing other databases.

ivagljiva avatar Oct 16 '23 07:10 ivagljiva

Thanks for incorporating anvi-run-cazymes into your anvi'o metabolism framework and finding ways to optimize the program!

In my opinion, the expected behavior here would be to use the ID number of the CAZy class (ie, GH73) as the accession number for these annotations to make them 1) usable/searchable in downstream applications and 2) consistent across all CAZyme annotations.

I completely agree with your suggestions and thanks for taking the to document a programmatic solution - I'll take a swing at this! I've already tagged this issue in #2099 because dbCAN3 might have solved this accession issue.

mschecht avatar Oct 16 '23 13:10 mschecht

you are the best @mschecht, thank you! :)

ivagljiva avatar Oct 16 '23 13:10 ivagljiva