cobrapy
cobrapy copied to clipboard
Switch SBML parser defaults to not rewriting IDs
Problem description
Reading a SBML model with ASCII codes in the IDs may generate invalid cobrapy IDs during parsing (for instance names including spaces).
Code Sample
That one arose in our group. A colleague wanted to read SBML model with the following species:
<species metaid="M_trans__45__2__45__Methyl__45__5__45__isopropylhexa__45__2__44__5__45__dienoic__32__acid__91__c0__93__" id="M_trans__45__2__45__Methyl__45__5__45__isopropylhexa__45__2__44__5__45__dienoic__32__acid__91__c0__93__" name="trans_2_Methyl_5_isopropylhexa_2_5_dienoic_acid_c0" compartment="c0" hasOnlySubstanceUnits="false" boundaryCondition="false" constant="false" fbc:charge="-1"/>
Actual Output
Will raise an error "trans-2-Methyl-5-isopropylhexa-2,5-dienoic acid[c0]" contains whitespace character " .
Expected Output
Since the original id in the SBML is valid this should be read as is or the parser should at least not substitute forbidden characters like __32__ being translated to a space.
I have no clue who inserted the corresponding code in https://github.com/opencobra/cobrapy/blob/devel/cobra/io/sbml.py i.e.
def _escape_non_alphanum(nonASCII):
"""converts a non alphanumeric character to a string representation of
its ascii number """
return '__' + str(ord(nonASCII.group())) + '__'
def _number_to_chr(numberStr):
"""converts an ascii number to a character """
return chr(int(numberStr.group(1)))
but this exactly the problem one creates by randomly replacing things in ids. I am strongly against doing this!
Seems to have been pulled in with #901 and the reason was that cobratoolbox does it this way. I think for the direction of cobrapy ID -> SBML ID this makes sense since cobrapy has different specs for IDs. But I agree this should not be done for SBML -> cobrapy.
Yes, I am aware why this is happening, but it just breaks things for the community as a whole. To apply arbitrary id replacements just breaks workflows, because people create ids to identify things and map things on them. If people arbitrarily change those ids in tools it is just bad practise. What currently is happening:
- the bigg database decided to have arbitrary prefixes on ids
M_, _M`, but some other people decided they don't like them, so let's just remove them - the COBRA matlab people decided ASCII replacements is a good thing, so let's do this also
- so I don't like
aandeso we should probably also just delete this from identifiers
The SBML parser has a dictionary of replacements everybody can apply! if they think they are good. But we should not force arbitrary replacements on the community as a whole, because some tools decide this is what they want to do.
See here documentation of the function:
def read_sbml_model(filename, number=float, f_replace=F_REPLACE,
**kwargs):
"""Reads SBML model from given filename.
If the given filename ends with the suffix ''.gz'' (for example,
''myfile.xml.gz'),' the file is assumed to be compressed in gzip
format and will be automatically decompressed upon reading. Similarly,
if the given filename ends with ''.zip'' or ''.bz2',' the file is
assumed to be compressed in zip or bzip2 format (respectively). Files
whose names lack these suffixes will be read uncompressed. Note that
if the file is in zip format but the archive contains more than one
file, only the first file in the archive will be read and the rest
ignored.
To read a gzip/zip file, libSBML needs to be configured and linked
with the zlib library at compile time. It also needs to be linked
with the bzip2 library to read files in bzip2 format. (Both of these
are the default configurations for libSBML.)
This function supports SBML with FBC-v1 and FBC-v2. FBC-v1 models
are converted to FBC-v2 models before reading.
The parser tries to fall back to information in notes dictionaries
if information is not available in the FBC packages, e.g.,
CHARGE, FORMULA on species, or GENE_ASSOCIATION, SUBSYSTEM on reactions.
Parameters
----------
filename : path to SBML file, or SBML string, or SBML file handle
SBML which is read into cobra model
number: data type of stoichiometry: {float, int}
In which data type should the stoichiometry be parsed.
f_replace : dict of replacement functions for id replacement
Dictionary of replacement functions for gene, specie, and reaction.
By default the following id changes are performed on import:
clip G_ from genes, clip M_ from species, clip R_ from reactions
If no replacements should be performed, set f_replace={}, None
so if you want to replace a and e just specify this in f_replace when calling the SBML parser, same for the ASCII things.
This still would not solve the cobrapy to SBML problem. "atp[c]" is a valid cobrapy ID but not a valid SBML ID. We could throw an error in this case but this would probably lead to users not using SBML and switching to JSON which is also not our goal I think. What would be your suggestion for this case?
A large part of the community clearly stated SBML Level 3 and FBC is the way to exchange FBC models in the recent MEMOTE paper (https://www.nature.com/articles/s41587-020-0446-y), including @cdiener and key people in the COBRA toolbox. Having identifiers which are incompatible with SBML is a relict from 15 years ago when the COBRA toolbox first started. Now enforcing an out of date convention which clearly creates more problems then it solves (see CPLEX problems just for an instance or also this issue) is something which I just don't understand. People in the COBRA community have to realize there are a ton of FBC tools out there which are not COBRA and as a community we should work on the interoperability of all tools not on a very narrow subset of COBRA-cobrapy which breaks things for everybody else in the FBC ecosystem.
There should be a clear warning on invalid identifiers instead of trying to monkey-patch identifiers. My biggest problem with that is that: You cannot map data on such networks, because the visualization tools and many analysis tools work with SBML and SBML identifiers. So why not just encourage good practise and give warnings on invalid identifiers?
And just a side note. Valid SBML identifiers would also solve many other issues out of the box such as the CPLEX problems or autocompletion of variables (#828). Most tools expect id strings which would be valid variable names in the respective programming languages, i.e. not starting with _, no special characters, not starting with [0-9].
Just have to put the sentence from the memote 2020 paper here. Basically all of the cobrapy core developer are on the paper with the clear statement:
A dual approach could be used to improve GEM reuse and reproducibility. First, we advocate adoption of the latest version of the Systems Biology Markup Language (SBML) level 3 flux balance constraints (SBML3FBC) package as the primary description and exchange format.
Highly encouraging the use of SBML compliant ids should be the right thing to do here.
While I agree with your points on warning about invalid identifiers @matthiaskoenig and I certainly stand by the above support for SBML, we also have to face the reality that there are hundreds (thousands?) of models that follow the old conventions. I certainly don't expect every user of such a model to go and fix the IDs (especially since it is so unclear how and where they should do so in a sustainable manner) so we have to be careful at which point the warnings become detrimental.
I don't think I advocated for not using SBML in my previous statement. On the contrary, I want people to use it whenever possible. I don't think that attacks on the community or me are conducive to reaching that goal. I rather asked for your advice on how we should treat this.
FWIW my preference would be always load IDs without modification from SBMl models and provide helpers to convert them to the legacy format if absolutely necessary. For saving SBML there should be an error that points towards a function that can convert IDs to SBML compliant format if the user does not want to do so themselves. This could use the ASCII substitutions as in the COBRA toolbox or any other method.
From what I can gather from our communication channels and questionnaires the part of the community that builds models themselves uses tools (Kbase, CarveME) that will produce SBML compliant IDs anyways. Those issues arise more often with models being passed on to researchers from their PI/collaborator and they are often just looking for a way of making it compliant with SBML in the easiest way possible.
Definitely backwards compatibility and reading of old models is a big issue. Having too many warnings is definitely detrimental to user experience, but at least one warning making aware of the problem when reading such old models should be shown to the user. Not sure if we could have something like a strict flag or something similar which would enable more strict warnings which can be disabled if people want to work with old models.
On Fri, Jul 3, 2020 at 10:47 PM Christian Diener [email protected] wrote:
I don't think I advocated for not using SBML in my previous statement. On the contrary, I want people to use it whenever possible. I don't think that personal attacks on the community or me are conducive to reaching that goal. I rather asked for your advice on how we should treat this.
FWIW my preference would be always load IDs without modification from SBMl models and provide helpers to convert them to the legacy format if absolutely necessary. For saving SBML there should be an error that points towards a function that can convert IDs to SBML compliant format if the user does not want to do so themselves. This could use the ASCII substitutions as in the COBRA toolbox or any other method.
From what I can gather from our communication channels and questionnaires the part of the community that builds models themselves uses tools (Kbase, CarveME) that will produce SBML compliant IDs anyways. Those issues arise more often with models being passed on to researchers from their PI/collaborator and they are often just looking for a way of making it compliant with SBML in the easiest way possible.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/cobrapy/issues/955#issuecomment-653673848, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG33OXA6FRMDCDQDXUAQGTRZY7XJANCNFSM4M3RNNTQ .
-- Matthias König, PhD. Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt Universität zu Berlin, Institute of Biology, Institute for Theoretical Biology https://livermetabolism.com [email protected] https://twitter.com/konigmatt https://github.com/matthiaskoenig Tel: +49 30 2093 98435
R has this functionality where it collects many warnings and one can call warnings() to see them at a later point. Maybe we could issue different warnings, like "There were 99 problems with metabolite identifiers. Please run validate_sbml_model if you want to see the details.". What do you think?
@Midnighter This sounds great. I was thinking in the same direction. This will also speed up the parsing of models with many warnings because we are not printing all the warnings, but just the overview warning. But it should be easy to get to the full set of warnings and issues with the model.
We are having the same problem, and in an extreme case it can crash the Python kernel. For example, a reaction with id A2____2___A is replaced with a non-alphanumeric ASCII character, which breaks the glpk solver and crashes the Python kernel.
The following code reproduces this error:
import cobra
model = cobra.io.load_model('BIOMD0000000586')
Crashing the Python kernel with:
glp_set_col_name: j = 27: column name contains invalid character(s)
Error detected in file ..\src\api\prob1.c at line 507
We have implemented this hack to temporally fix this, in case it is useful to anyone having this problem:
import cobra
from typing import Match
def _number_to_chr_safe(numberStr: Match) -> str:
ascii = chr(int(numberStr.group(1)))
return numberStr.group(1) if cobra.io.sbml.pattern_to_sbml.match(ascii) else ascii
cobra.io.sbml._number_to_chr = _number_to_chr_safe
Any advice for a better solution (besides implementing our own replacement functions) would be appreciated.
My advice is
- use valid SBML identifiers in your model
- don't apply any replacements!
Personally I hate these replacements so much and they should be deactivated by default. It can't be that a software project has to deal with issues related to incorrect usage of a standard. E.g. nobody expects that cobrapy deals with invalid xml, but everybody expects that the parser should fix incorrect ids ?!
It would be great if we could have a vote of getting rid of the default replacements.
@dnlobo Just load your model without replacements
from cobra.io import read_sbml_model
model = read_sbml_model('BIOMD0000000586', f_replace={})
@matthiaskoenig thank you for your answer. I agree with you that the current replacements are very problematic, especially when they can crash the whole Python kernel. Removing them altogether and leave it to the user to define is certainly the easiest way to fix these problems in cobrapy.
@matthiaskoenig, thank you for the workaround. My student lost around two weeks trying to get around this issue.
COBRAPY, you guys are awesome, please do not change model IDs without users explicitly asking for it.
Hello everyone, i try to create a model by using carveme -(with special universe named PhotoEukStein https://www.genoscope.cns.fr/PhotoEukStein/)
i obtain this one model_65.zip
i try to validate my model so i verified it : cobra.io.validate_sbml_model('./model_65.xml') but : glp_set_row_name: i = 1917: row name contains invalid character(s) Error detected in file api/prob1.c at line 450 Abandon (core dumped)
so as mention in #964 (https://github.com/opencobra/cobrapy/issues/964) it was the same problem of invalid characters so as @matthiaskoenig advice here i read the model with f_replace={} but now it's the long ids problem's.. So i take a look in #62 (https://github.com/matthiaskoenig/fbc_curation/issues/62) but i didn't achieve how to find and rename the affected id(s). i want to know how to do it to finally make my FBA. (it said here that normally this problem is fixed in optlang https://github.com/opencobra/cobrapy/issues/609 i'm a bit lost i have to admit ahah)
If everyone could help me i will be very gratefull !!
@Ed0u66 as a work around, if you install OSQP and HIGHSpy, you can use the hybrid optlang interface, which has no issues with "invalid" characters.