datanator BRENDA content collaboration

Hi,

I'm currently working on upgrading my parser for the BRENDA flat file download. I've implemented a few SQLAlchemy models that seemed fitting for the content. Is there any interest on your side in the content of BRENDA?

Feb 20 '18 10:02 Midnighter

Rik van Rosmalen has also written a BRENDA parser https://gitlab.com/wurssb/brenda-parser

Currently it dumps all of Brenda either in a SQLLite DB or a JSON file

One of the main issues right now is that BRENDA's download does not include a metabolite reference table or any cross-references. However, UniChem does cross-reference metabolites to BRENDA via InChi, and has all their data open. This could make integration possible.

Nov 23 '18 20:11 jonrkarr

@Midnighter, we're finally starting to work on BRENDA. We're trying to determine if BRENDA contains a record of the reaction associated with each K_cat and K_m (which SABIO-RK clearly displays). Neither the website or the text file shows this information, but the BRENDA output seems to contain this information. I suspect that the SBML output contains inferred kinetic parameters, rather than directly measured kinetic constants. Do you know what information is encoded in the SBML output?

Any code we write will be shared via this repo.

We tried to use Rik's code. Unfortunately, it appears to be out of date with respect to the current format of the BRENDA text file.

Apr 22 '20 23:04 jonrkarr

We're trying to determine if BRENDA contains a record of the reaction associated with each K_cat and K_m

I don't fully understand what you want to achieve. Given a specific Kcat or Km value, you want to list all reactions (by EC-code) that have this value? This should be possible with a SQL query, however, there are many reactions in BRENDA that specify Kcat and Km as ranges rather than fixed values. The same EC-code can also have different Kcat and Km values in different organisms, of course.

I still haven't finished my BRENDA work as it was not high priority to me. I do have a branch that uses pyparsing to go over the flat file and it's quite promising. I can try to deliver a working version by the end of May.

Apr 23 '20 08:04 Midnighter

SABIO-RK contains information about the exact reaction associated with each measured kinetic parameter. In addition, SABIO-RK often presents pairs of kinetic parameters that were measured together (e.g., paired k_cat, K_m).

In contrast, the BRENDA website, text file, and SOAP interface present coarser information. This is why we have preferred to work with SABIO-RK, even though SABIO-RK is also difficult to scrape. The BRENDA website only displays the EC number associated with each kinetic measurement, and the website doesn't present pairs of parameters.

It appears that BRENDA annotates reactions more coarsely than SABIO-RK. However, BRENDA's SBML output suggests that the underlying BRENDA database might have finer-grained reaction information than what is presented in the BRENDA website, text file, and SOAP interface. We haven't found any documentation about the SBML output. We're trying to understand what those files means, and if this is a way to pull more information out of BRENDA than what is provided in the text file.

Apr 23 '20 13:04 jonrkarr

SABIO-RK contains information about the exact reaction associated with each measured kinetic parameter. In addition, SABIO-RK often presents pairs of kinetic parameters that were measured together (e.g., paired k_cat, K_m).

In contrast, the BRENDA website, text file, and SOAP interface present coarser information. This is why we have preferred to work with SABIO-RK, even though SABIO-RK is also difficult to scrape. The BRENDA website only displays the EC number associated with each kinetic measurement, and the website doesn't present pairs of parameters.

It appears that BRENDA annotates reactions more coarsely than SABIO-RK. However, BRENDA's SBML output suggests that the underlying BRENDA database might have finer-grained reaction information than what is presented in the BRENDA website, text file, and SOAP interface. We haven't found any documentation about the SBML output. We're trying to understand what those files mean, and if they are a way to pull more information out of BRENDA than what is provided in the text file.

Apr 23 '20 13:04 jonrkarr

SABIO-RK contains information about the exact reaction associated with each measured kinetic parameter. In addition, SABIO-RK often presents pairs of kinetic parameters that were measured together (e.g., paired k_cat, K_m).

In contrast, the BRENDA website, text file, and SOAP interface present coarser information. This is why we have preferred to work with SABIO-RK, even though SABIO-RK is also difficult to scrape. The BRENDA website only displays the EC number associated with each kinetic measurement, and the website doesn't present pairs of parameters.

It appears that BRENDA annotates reactions more coarsely than SABIO-RK. However, BRENDA's SBML output suggests that the underlying BRENDA database might have finer-grained reaction information than what is presented in the BRENDA website, text file, and SOAP interface. We haven't found any documentation about the SBML output. We're trying to understand what those files mean, and if they are a way to pull more information out of BRENDA than what is provided in the text file.

Apr 23 '20 13:04 jonrkarr

I have not found a way to reliably scrape all SBML output files from BRENDA as this required paid access previously, I think. It would be preferable, though, of course, to the terrible test format.

With regard to the information that you are looking for: BRENDA gives entries for the K_cat value divided by the K_m value, for example,

KKM	#2# 314 (#2# recombinant isozyme, pH 7.5, 30°C <45>) <45>

So one could look at the matching K_m value (by protein and citation), in this case

KM	#2# 0.165 {GMP}  (#2# recombinant isozyme, pH 7.5, 30°C <45>) <45>

FYI, this is for EC-code 2.7.4.8 and this specific entry is for

PR	#2# Bacillus subtilis   <45>

So that would give you what you are looking for?

Apr 23 '20 14:04 Midnighter

Basically, we're trying to infer the link between the SP entries and the TN, KM, and KKM entries.

I don't think the BRENDA text files provide enough information to reconstruct this.

Each PR entry can be associated with multiple SP entries
Each PR entry can have multiple associated KM, TN, and KKM entries, far more than the number of substrates of products of a single reaction.
Each RF , can be associated with many PR, KM, TN and KKM entries

This is what motivated us to look at the other BRENDA outputs, to try to extract this mapping out of BRENDA.

Apr 23 '20 14:04 jonrkarr

I'll contact BRENDA to ask them about the SBML output. I can share what I learn.

Apr 23 '20 14:04 jonrkarr

It would be super nice to just get a database dump rather than having to jump through so many hoops.

Apr 23 '20 14:04 Midnighter

I'm looking to understand if the text file lacks relationships between KM and TN entries that the underlying database captures, and if these relationships are captured, I'd like to obtain this information.

A database dump would be nice. Any format with this relational information would be an improvement.

Apr 23 '20 14:04 jonrkarr

I still think it's possible to tell these apart, however, if you look at the comment in each entry.

TN	#2# 52 {GMP}  (#2# recombinant isozyme, pH 7.5, 30°C <45>) <45>

There is only one entry in each section that has the same protein reference #2#, comment (...) and literature reference <45>.

I'm not sure what you gain from the SP entry. The substrate is already provided in the KM and TN entries.

So if you start with KM or TN entries you should be able to identify all the information that you need?

I've only looked at a few examples, though, so I'm easily proven wrong. Also, it'd be painful to parse the information in this way so something structured is definitely preferable :+1:

Apr 23 '20 14:04 Midnighter

It shouldn't be this hard.

Inferring the reaction associated with each `KM`, `TN` entry from the substrate information

The substrate of each KM or TN entry doesn't contain information about the entire reaction. The reaction can't be inferred from the substrate because the metabolite can participate in multiple reactions.

For example, you can't infer the reaction associated with this TN

TN      #114# 1646 {NADH}  (#114# cosubstrate acetaldehyde, pH 8.0, 60°C <215>)
        <215>

because multiple SP entries involve NADH

SP      #96# hexaldehyde + NADH + H+ = 1-hexanol + NAD+ (#96# 7% activity
        compared to benzyl alcohol <156>) <156>
SP      #96# hydrocinnamaldehyde + NADH + H+ = hydrocinnamyl alcohol + NAD+
        (#96# 12% activity compared to benzyl alcohol <156>) {r} <156>
SP      #96# nonyl aldehyde + NADH + H+ = 1-nonanol + NAD+ (#96# 25% activity
        compared to benzyl alcohol <156>) <156>
SP      #96# octyl aldehyde + NADH + H+ = 1-octanol + NAD+ (#96# 29% activity
        compared to benzyl alcohol <156>) <156>

Inferring pairs of `KM`, `TN`, `KKM`, `SP` from unique tuples of substrates, comments, and references

This is an interesting idea. This might work for inferring relationships between KM and TN entries. I don't think this will work for inferring relationships between KKM and other entries because they don't include substrates. The SP entries don't appear to have the same comments as KM and TN entries.

Example from 1.1.1.1:

KKM	#115# 3.6 (#115# cosubstrate NADP+, pH 8.0, 60°C <215>) <215>
KKM	#115# 67.2 (#115# cosubstrate NADP+, pH 8.0, 60°C <215>) <215>

Apr 23 '20 14:04 jonrkarr

Okay, that's a clear counter example. Let's see if you get a reply from BRENDA. I tried once some years back and never got an answer. I was probably not persistent enough.

The way that the textual data is structured I would definitely manually check a number of example to see if the associations presented by BRENDA are correct...

Apr 23 '20 15:04 Midnighter

FYI, I think the SBML output would also be difficult to use. It times out easily. You'd have to figure out how to make the queries small enough not to time out. One possibility is to iterate of each EC and each organism.

for ec_code in ec_codes:
    for organism in organisms:
        get-sbml(ec_code, organism)

Apr 23 '20 15:04 jonrkarr

Also the SBML output is missing some of the information from the HTML preview of the SBML

No enzyme info (UniProt id)
No comments
No references

The SMBL does give insight into how to parse temperature and pH from the comments:

r'(^|,[ \n])(\d+(\.\d+)?)°C(,[ \n]|$)'
r'(^|,[ \n])pH[ \n](\d+(\.\d+)?)(,[ \n]|$)'

Apr 23 '20 15:04 jonrkarr

I'm looking into your suggestion about matching tuples of protein ids, comments, and references. This might work for pairing k_cats with K_ms, but I don't think this works for inferring the reaction associated with each k_cat/K_m. It doesn't look like these relationships have been encoded into the text file. While you can find pairs of entries with overlapping protein ids, substrates, comments, and references, it appears to be difficult to unambiguously resolve relationships. I think trying to infer relationships is likely to infer false relationships that are not present in the underlying database. At least for our purposes, we're hesitant to add additional interpretation on top of the BRENDA data.

In spite of these problems, I think BRENDA is doing exactly what you've suggested to build the SBML output. However, I think this is difficult to replicate because we don't know the details how BRENDA is encoded into the text file.

Apr 23 '20 18:04 jonrkarr

I got a response from the BRENDA team:

Recently, they have begun to track the specific reaction associated with each KM and TN. However, I don't think we have a way to access this information, or to discern which entries have this metadata.
For the the oldest curated entries (entries curated > 15 years ago), there is no way to discern the reaction associated with KM and TN because these entries don't have sufficient metadata to attempt to infer the associated reaction. The BRENDA team is slowly filling in this missing metadata.
For most entries, the organism, comments, and references can potentially be used to infer the specific reaction associated with each KM and TN. However, there's no way avoid inferring false relationships.
We don't have any timestamps that we can use to discern when an entry was curated.

For Datanator, we're hesitant to infer false relationships. We want Datanator to be as free of interpretation as possible so that our downstream projects have as much control over the representational of experimental data as possible.

Apr 27 '20 17:04 jonrkarr

Thanks for the input. Any word on accessing all SBML or other structured data set?

Apr 27 '20 17:04 Midnighter

The BRENDA team didn't respond to my question about the SMBL output. I suspect the reactions in the SBML output are inferred from common enzymes, comments, and references. I think the temperature and pH are also inferred by similar string pattern matching of the comments.

There's no other more structured output available. In any case, this wouldn't have the missing relationships because they have never been recorded.

If you're looking for a more structured dataset, I recommend SABIO-RK.

Apr 27 '20 17:04 jonrkarr

datanator datanator copied to clipboard

BRENDA content collaboration

Inferring the reaction associated with each KM, TN entry from the substrate information

Inferring pairs of KM, TN, KKM, SP from unique tuples of substrates, comments, and references

datanator
datanator copied to clipboard

Inferring the reaction associated with each `KM`, `TN` entry from the substrate information

Inferring pairs of `KM`, `TN`, `KKM`, `SP` from unique tuples of substrates, comments, and references