datanator
datanator copied to clipboard
BRENDA content collaboration
Hi,
I'm currently working on upgrading my parser for the BRENDA flat file download. I've implemented a few SQLAlchemy models that seemed fitting for the content. Is there any interest on your side in the content of BRENDA?
Rik van Rosmalen has also written a BRENDA parser https://gitlab.com/wurssb/brenda-parser
Currently it dumps all of Brenda either in a SQLLite DB or a JSON file
One of the main issues right now is that BRENDA's download does not include a metabolite reference table or any cross-references. However, UniChem does cross-reference metabolites to BRENDA via InChi, and has all their data open. This could make integration possible.
@Midnighter, we're finally starting to work on BRENDA. We're trying to determine if BRENDA contains a record of the reaction associated with each K_cat and K_m (which SABIO-RK clearly displays). Neither the website or the text file shows this information, but the BRENDA output seems to contain this information. I suspect that the SBML output contains inferred kinetic parameters, rather than directly measured kinetic constants. Do you know what information is encoded in the SBML output?
Any code we write will be shared via this repo.
We tried to use Rik's code. Unfortunately, it appears to be out of date with respect to the current format of the BRENDA text file.
We're trying to determine if BRENDA contains a record of the reaction associated with each K_cat and K_m
I don't fully understand what you want to achieve. Given a specific Kcat or Km value, you want to list all reactions (by EC-code) that have this value? This should be possible with a SQL query, however, there are many reactions in BRENDA that specify Kcat and Km as ranges rather than fixed values. The same EC-code can also have different Kcat and Km values in different organisms, of course.
I still haven't finished my BRENDA work as it was not high priority to me. I do have a branch that uses pyparsing to go over the flat file and it's quite promising. I can try to deliver a working version by the end of May.
SABIO-RK contains information about the exact reaction associated with each measured kinetic parameter. In addition, SABIO-RK often presents pairs of kinetic parameters that were measured together (e.g., paired k_cat, K_m).
In contrast, the BRENDA website, text file, and SOAP interface present coarser information. This is why we have preferred to work with SABIO-RK, even though SABIO-RK is also difficult to scrape. The BRENDA website only displays the EC number associated with each kinetic measurement, and the website doesn't present pairs of parameters.
It appears that BRENDA annotates reactions more coarsely than SABIO-RK. However, BRENDA's SBML output suggests that the underlying BRENDA database might have finer-grained reaction information than what is presented in the BRENDA website, text file, and SOAP interface. We haven't found any documentation about the SBML output. We're trying to understand what those files means, and if this is a way to pull more information out of BRENDA than what is provided in the text file.
SABIO-RK contains information about the exact reaction associated with each measured kinetic parameter. In addition, SABIO-RK often presents pairs of kinetic parameters that were measured together (e.g., paired k_cat, K_m).
In contrast, the BRENDA website, text file, and SOAP interface present coarser information. This is why we have preferred to work with SABIO-RK, even though SABIO-RK is also difficult to scrape. The BRENDA website only displays the EC number associated with each kinetic measurement, and the website doesn't present pairs of parameters.
It appears that BRENDA annotates reactions more coarsely than SABIO-RK. However, BRENDA's SBML output suggests that the underlying BRENDA database might have finer-grained reaction information than what is presented in the BRENDA website, text file, and SOAP interface. We haven't found any documentation about the SBML output. We're trying to understand what those files mean, and if they are a way to pull more information out of BRENDA than what is provided in the text file.
SABIO-RK contains information about the exact reaction associated with each measured kinetic parameter. In addition, SABIO-RK often presents pairs of kinetic parameters that were measured together (e.g., paired k_cat, K_m).
In contrast, the BRENDA website, text file, and SOAP interface present coarser information. This is why we have preferred to work with SABIO-RK, even though SABIO-RK is also difficult to scrape. The BRENDA website only displays the EC number associated with each kinetic measurement, and the website doesn't present pairs of parameters.
It appears that BRENDA annotates reactions more coarsely than SABIO-RK. However, BRENDA's SBML output suggests that the underlying BRENDA database might have finer-grained reaction information than what is presented in the BRENDA website, text file, and SOAP interface. We haven't found any documentation about the SBML output. We're trying to understand what those files mean, and if they are a way to pull more information out of BRENDA than what is provided in the text file.
I have not found a way to reliably scrape all SBML output files from BRENDA as this required paid access previously, I think. It would be preferable, though, of course, to the terrible test format.
With regard to the information that you are looking for: BRENDA gives entries for the K_cat value divided by the K_m value, for example,
KKM #2# 314 (#2# recombinant isozyme, pH 7.5, 30°C <45>) <45>
So one could look at the matching K_m value (by protein and citation), in this case
KM #2# 0.165 {GMP} (#2# recombinant isozyme, pH 7.5, 30°C <45>) <45>
FYI, this is for EC-code 2.7.4.8 and this specific entry is for
PR #2# Bacillus subtilis <45>
So that would give you what you are looking for?
Basically, we're trying to infer the link between the SP entries and the TN, KM, and KKM entries.
I don't think the BRENDA text files provide enough information to reconstruct this.
- Each
PRentry can be associated with multipleSPentries - Each
PRentry can have multiple associatedKM,TN, andKKMentries, far more than the number of substrates of products of a single reaction. - Each
RF, can be associated with manyPR,KM,TNandKKMentries
This is what motivated us to look at the other BRENDA outputs, to try to extract this mapping out of BRENDA.
I'll contact BRENDA to ask them about the SBML output. I can share what I learn.
It would be super nice to just get a database dump rather than having to jump through so many hoops.
I'm looking to understand if the text file lacks relationships between KM and TN entries that the underlying database captures, and if these relationships are captured, I'd like to obtain this information.
A database dump would be nice. Any format with this relational information would be an improvement.
I still think it's possible to tell these apart, however, if you look at the comment in each entry.
TN #2# 52 {GMP} (#2# recombinant isozyme, pH 7.5, 30°C <45>) <45>
There is only one entry in each section that has the same protein reference #2#, comment (...) and literature reference <45>.
I'm not sure what you gain from the SP entry. The substrate is already provided in the KM and TN entries.
So if you start with KM or TN entries you should be able to identify all the information that you need?
I've only looked at a few examples, though, so I'm easily proven wrong. Also, it'd be painful to parse the information in this way so something structured is definitely preferable :+1:
It shouldn't be this hard.
Inferring the reaction associated with each KM, TN entry from the substrate information
The substrate of each KM or TN entry doesn't contain information about the entire reaction. The reaction can't be inferred from the substrate because the metabolite can participate in multiple reactions.
For example, you can't infer the reaction associated with this TN
TN #114# 1646 {NADH} (#114# cosubstrate acetaldehyde, pH 8.0, 60°C <215>)
<215>
because multiple SP entries involve NADH
SP #96# hexaldehyde + NADH + H+ = 1-hexanol + NAD+ (#96# 7% activity
compared to benzyl alcohol <156>) <156>
SP #96# hydrocinnamaldehyde + NADH + H+ = hydrocinnamyl alcohol + NAD+
(#96# 12% activity compared to benzyl alcohol <156>) {r} <156>
SP #96# nonyl aldehyde + NADH + H+ = 1-nonanol + NAD+ (#96# 25% activity
compared to benzyl alcohol <156>) <156>
SP #96# octyl aldehyde + NADH + H+ = 1-octanol + NAD+ (#96# 29% activity
compared to benzyl alcohol <156>) <156>
Inferring pairs of KM, TN, KKM, SP from unique tuples of substrates, comments, and references
This is an interesting idea. This might work for inferring relationships between KM and TN entries. I don't think this will work for inferring relationships between KKM and other entries because they don't include substrates. The SP entries don't appear to have the same comments as KM and TN entries.
Example from 1.1.1.1:
KKM #115# 3.6 (#115# cosubstrate NADP+, pH 8.0, 60°C <215>) <215>
KKM #115# 67.2 (#115# cosubstrate NADP+, pH 8.0, 60°C <215>) <215>
Okay, that's a clear counter example. Let's see if you get a reply from BRENDA. I tried once some years back and never got an answer. I was probably not persistent enough.
The way that the textual data is structured I would definitely manually check a number of example to see if the associations presented by BRENDA are correct...
FYI, I think the SBML output would also be difficult to use. It times out easily. You'd have to figure out how to make the queries small enough not to time out. One possibility is to iterate of each EC and each organism.
for ec_code in ec_codes:
for organism in organisms:
get-sbml(ec_code, organism)
Also the SBML output is missing some of the information from the HTML preview of the SBML
- No enzyme info (UniProt id)
- No comments
- No references
The SMBL does give insight into how to parse temperature and pH from the comments:
r'(^|,[ \n])(\d+(\.\d+)?)°C(,[ \n]|$)'r'(^|,[ \n])pH[ \n](\d+(\.\d+)?)(,[ \n]|$)'
I'm looking into your suggestion about matching tuples of protein ids, comments, and references. This might work for pairing k_cats with K_ms, but I don't think this works for inferring the reaction associated with each k_cat/K_m. It doesn't look like these relationships have been encoded into the text file. While you can find pairs of entries with overlapping protein ids, substrates, comments, and references, it appears to be difficult to unambiguously resolve relationships. I think trying to infer relationships is likely to infer false relationships that are not present in the underlying database. At least for our purposes, we're hesitant to add additional interpretation on top of the BRENDA data.
In spite of these problems, I think BRENDA is doing exactly what you've suggested to build the SBML output. However, I think this is difficult to replicate because we don't know the details how BRENDA is encoded into the text file.
I got a response from the BRENDA team:
- Recently, they have begun to track the specific reaction associated with each KM and TN. However, I don't think we have a way to access this information, or to discern which entries have this metadata.
- For the the oldest curated entries (entries curated > 15 years ago), there is no way to discern the reaction associated with KM and TN because these entries don't have sufficient metadata to attempt to infer the associated reaction. The BRENDA team is slowly filling in this missing metadata.
- For most entries, the organism, comments, and references can potentially be used to infer the specific reaction associated with each KM and TN. However, there's no way avoid inferring false relationships.
- We don't have any timestamps that we can use to discern when an entry was curated.
For Datanator, we're hesitant to infer false relationships. We want Datanator to be as free of interpretation as possible so that our downstream projects have as much control over the representational of experimental data as possible.
Thanks for the input. Any word on accessing all SBML or other structured data set?
The BRENDA team didn't respond to my question about the SMBL output. I suspect the reactions in the SBML output are inferred from common enzymes, comments, and references. I think the temperature and pH are also inferred by similar string pattern matching of the comments.
There's no other more structured output available. In any case, this wouldn't have the missing relationships because they have never been recorded.
If you're looking for a more structured dataset, I recommend SABIO-RK.