[BUG] Most annotations from anvi-run-cazymes have undefined ('-') accession numbers
Short description of the problem
When we annotate a contigs database with anvi-run-cazymes, most of the resulting hits in the gene functions table have - for an accession value:
It is not an issue coming from our code, but from the structure of the CAZyme HMM profiles, because we can see these undefined accessions in the hmmscan output when we run the program with --debug and check the temp output files:
$ head /var/folders/1n/2s6d_kq53pv9js812zwcljq80000gn/T/tmpiuhuh68p/hmm.table.fixed
438 - CBM32.hmm - 2.2e-34 114.8 1.7 1.6e-21 73.2 0.6 2.4 2 0 0 2 2 2 2 -
847 - CBM32.hmm - 1.7e-33 111.9 0.0 5e-21 71.6 0.0 2.6 2 0 0 2 2 2 2 -
871 - CBM32.hmm - 4.6e-27 91.1 0.3 1.5e-26 89.4 0.3 1.9 1 0 0 1 1 1 1 -
346 - CBM32.hmm - 3.4e-26 88.3 7.5 1.8e-25 86.0 2.8 2.6 2 0 0 2 2 2 1 -
892 - CBM32.hmm - 2.1e-25 85.7 0.0 4.5e-25 84.7 0.0 1.5 1 0 0 1 1 1 1 -
1069 - CBM32.hmm - 1.3e-14 50.9 5.2 2.4e-12 43.6 0.1 3.9 2 1 1 3 3 3 1 -
A related issue is that the function definition column contains the enzyme class names rather than the actual annotations (ie, CBM32.hmm when it would be more useful to see Carbohydrate-Binding Module Family 32, or GH73.hmm when it would be more useful to see Glycoside Hydrolase Family 73). With the '.hmm' extension after the class ID number, these also look like filenames rather than annotations.
The lack of unique accessions is a problem for anyone who wants to run anvi-estimate-metabolism with user-defined pathways using the CAZy database as a functional annotation source, because that requires a unique accession number to match each enzyme annotation to its pathway. It may also affect other downstream programs that rely on accession numbers like anvi-display-functions.
Expected behavior and suggested solution
In my opinion, the expected behavior here would be to use the ID number of the CAZy class (ie, GH73) as the accession number for these annotations to make them 1) usable/searchable in downstream applications and 2) consistent across all CAZyme annotations.
In practice, it is more complicated because some of the profiles seem to have accessions already - for instance, the profile for GT2_Glycos_transf_2 has the accession PF00535.25 (which seems to be its corresponding Pfam accession number), and the closely related GT2_Glyco_tranf_2_2 profile has the corresponding Pfam accession PF10111.8, etc. But since those accessions are coming from different databases (ie, Pfam, not CAZy), I think we should change every single CAZyme annotation to use the CAZy ID number as an accession, and if there is already an accession in place from Pfam or wherever, we can append that alternative accession to the end of the function definintion string.
Second, having the HMM profile filename as the function definition is completely useless. We should replace it with the actual annotation that gives people a better idea of what the protein is doing rather than forcing them to go look up the CAZy class online.
Here is an example of what I suggest the CAZyme annotations to look like, in the case that there is not an existing accession number:
| accession | function |
|---|---|
| CBM32 | Carbohydrate-Binding Module Family 32 |
| GH73 | Glycoside Hydrolase Family 73 |
And here is what they would look like in the case where there is an existing accession number:
| accession | function |
|---|---|
| GT2_Glycos_transf_2 | GlycosylTransferase Family 2 (PF00535.25) |
| GT2_Glyco_tranf_2_2 | GlycosylTransferase Family 2 (PF10111.8) |
| GT2_Glyco_tranf_2_3 | GlycosylTransferase Family 2 (PF13641.5) |
Since the CAZyme HMM profiles seem to be not set up very nicely, I guess the best way to implement these changes would be:
- find some way to map the CAZy class ID to its full definition, probably by creating a file during the runtime of
anvi-setup-cazymesthat could be later read into a dictionary duringanvi-run-cazymesfor creating the definition string - do some post-processing of the HMMER results in
cazyme.pyto a) parse the current 'definition' to remove the '.hmm' extension and set that as the 'accession' instead, b) match that accession to the human-readable name of the CAZy class to make the new 'definition' string and c) append any existing accession to the end of the 'definition' string
@mschecht , I would like to hear what you think about this. I am happy to work on implementing the solution if you don't currently have the bandwidth. :)
I was hoping to use CAZymes in a user-defined pathway in my upcoming tutorial, which at this point is sadly impossible, but regardless, it would be nice to make this annotation source usable with the metabolism framework by our next minor release.
anvi'o version
Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.13
Current version of CAZy database is V11.
System info
macOSX Sonoma 14.0
One more thing that I noticed is that there is no sanity check for re-running anvi-run-cazymes on a database that has already been annotated with CAZy, which means that existing annotations are automatically overwritten. This is a separate issue, but could be addressed alongside those mentioned above.
Hi @ivagljiva ,
The CAZy hmm profiles use the NAME field for both name and for the CAZy accession number. Tbh, there isn't much more that you can get out of the CAZy families given their breadth.
They also leave the ACC and DESC fields for cross-references - only 9 profiles that were incorporated into Pfam in the current V12.
Thanks for the insight @xvazquezc . I think my recommendations for post-processing of the hits still make sense given that information. If we are annotating our gene calls with CAZy, we want the CAZy accession numbers to be in the accession field regardless of whether the ACC field in the profile is used for cross-referencing other databases.
Thanks for incorporating anvi-run-cazymes into your anvi'o metabolism framework and finding ways to optimize the program!
In my opinion, the expected behavior here would be to use the ID number of the CAZy class (ie, GH73) as the accession number for these annotations to make them 1) usable/searchable in downstream applications and 2) consistent across all CAZyme annotations.
I completely agree with your suggestions and thanks for taking the to document a programmatic solution - I'll take a swing at this! I've already tagged this issue in #2099 because dbCAN3 might have solved this accession issue.
you are the best @mschecht, thank you! :)