cobrapy Switch SBML parser defaults to not rewriting IDs

Problem description

Reading a SBML model with ASCII codes in the IDs may generate invalid cobrapy IDs during parsing (for instance names including spaces).

Code Sample

That one arose in our group. A colleague wanted to read SBML model with the following species:

<species metaid="M_trans__45__2__45__Methyl__45__5__45__isopropylhexa__45__2__44__5__45__dienoic__32__acid__91__c0__93__" id="M_trans__45__2__45__Methyl__45__5__45__isopropylhexa__45__2__44__5__45__dienoic__32__acid__91__c0__93__" name="trans_2_Methyl_5_isopropylhexa_2_5_dienoic_acid_c0" compartment="c0" hasOnlySubstanceUnits="false" boundaryCondition="false" constant="false" fbc:charge="-1"/>

Actual Output

Will raise an error "trans-2-Methyl-5-isopropylhexa-2,5-dienoic acid[c0]" contains whitespace character " .

Expected Output

Since the original id in the SBML is valid this should be read as is or the parser should at least not substitute forbidden characters like __32__ being translated to a space.

May 07 '20 18:05 cdiener

I have no clue who inserted the corresponding code in https://github.com/opencobra/cobrapy/blob/devel/cobra/io/sbml.py i.e.

def _escape_non_alphanum(nonASCII):
    """converts a non alphanumeric character to a string representation of
    its ascii number """
    return '__' + str(ord(nonASCII.group())) + '__'


def _number_to_chr(numberStr):
    """converts an ascii number to a character """
    return chr(int(numberStr.group(1)))

but this exactly the problem one creates by randomly replacing things in ids. I am strongly against doing this!

May 21 '20 13:05 matthiaskoenig

Seems to have been pulled in with #901 and the reason was that cobratoolbox does it this way. I think for the direction of cobrapy ID -> SBML ID this makes sense since cobrapy has different specs for IDs. But I agree this should not be done for SBML -> cobrapy.

May 21 '20 15:05 cdiener

Yes, I am aware why this is happening, but it just breaks things for the community as a whole. To apply arbitrary id replacements just breaks workflows, because people create ids to identify things and map things on them. If people arbitrarily change those ids in tools it is just bad practise. What currently is happening:

the bigg database decided to have arbitrary prefixes on ids M_, _M`, but some other people decided they don't like them, so let's just remove them
the COBRA matlab people decided ASCII replacements is a good thing, so let's do this also
so I don't like a and e so we should probably also just delete this from identifiers

The SBML parser has a dictionary of replacements everybody can apply! if they think they are good. But we should not force arbitrary replacements on the community as a whole, because some tools decide this is what they want to do.

May 23 '20 09:05 matthiaskoenig

See here documentation of the function:

def read_sbml_model(filename, number=float, f_replace=F_REPLACE,
                    **kwargs):
    """Reads SBML model from given filename.
    If the given filename ends with the suffix ''.gz'' (for example,
    ''myfile.xml.gz'),' the file is assumed to be compressed in gzip
    format and will be automatically decompressed upon reading. Similarly,
    if the given filename ends with ''.zip'' or ''.bz2',' the file is
    assumed to be compressed in zip or bzip2 format (respectively).  Files
    whose names lack these suffixes will be read uncompressed.  Note that
    if the file is in zip format but the archive contains more than one
    file, only the first file in the archive will be read and the rest
    ignored.
    To read a gzip/zip file, libSBML needs to be configured and linked
    with the zlib library at compile time.  It also needs to be linked
    with the bzip2 library to read files in bzip2 format.  (Both of these
    are the default configurations for libSBML.)
    This function supports SBML with FBC-v1 and FBC-v2. FBC-v1 models
    are converted to FBC-v2 models before reading.
    The parser tries to fall back to information in notes dictionaries
    if information is not available in the FBC packages, e.g.,
    CHARGE, FORMULA on species, or GENE_ASSOCIATION, SUBSYSTEM on reactions.
    Parameters
    ----------
    filename : path to SBML file, or SBML string, or SBML file handle
        SBML which is read into cobra model
    number: data type of stoichiometry: {float, int}
        In which data type should the stoichiometry be parsed.
    f_replace : dict of replacement functions for id replacement
        Dictionary of replacement functions for gene, specie, and reaction.
        By default the following id changes are performed on import:
        clip G_ from genes, clip M_ from species, clip R_ from reactions
        If no replacements should be performed, set f_replace={}, None

so if you want to replace a and e just specify this in f_replace when calling the SBML parser, same for the ASCII things.

May 23 '20 09:05 matthiaskoenig

This still would not solve the cobrapy to SBML problem. "atp[c]" is a valid cobrapy ID but not a valid SBML ID. We could throw an error in this case but this would probably lead to users not using SBML and switching to JSON which is also not our goal I think. What would be your suggestion for this case?

May 24 '20 01:05 cdiener

A large part of the community clearly stated SBML Level 3 and FBC is the way to exchange FBC models in the recent MEMOTE paper (https://www.nature.com/articles/s41587-020-0446-y), including @cdiener and key people in the COBRA toolbox. Having identifiers which are incompatible with SBML is a relict from 15 years ago when the COBRA toolbox first started. Now enforcing an out of date convention which clearly creates more problems then it solves (see CPLEX problems just for an instance or also this issue) is something which I just don't understand. People in the COBRA community have to realize there are a ton of FBC tools out there which are not COBRA and as a community we should work on the interoperability of all tools not on a very narrow subset of COBRA-cobrapy which breaks things for everybody else in the FBC ecosystem.

There should be a clear warning on invalid identifiers instead of trying to monkey-patch identifiers. My biggest problem with that is that: You cannot map data on such networks, because the visualization tools and many analysis tools work with SBML and SBML identifiers. So why not just encourage good practise and give warnings on invalid identifiers?

Jul 03 '20 05:07 matthiaskoenig

And just a side note. Valid SBML identifiers would also solve many other issues out of the box such as the CPLEX problems or autocompletion of variables (#828). Most tools expect id strings which would be valid variable names in the respective programming languages, i.e. not starting with _, no special characters, not starting with [0-9].

Jul 03 '20 06:07 matthiaskoenig

Just have to put the sentence from the memote 2020 paper here. Basically all of the cobrapy core developer are on the paper with the clear statement:

A dual approach could be used to improve GEM reuse and reproducibility. First, we advocate adoption of the latest version of the Systems Biology Markup Language (SBML) level 3 flux balance constraints (SBML3FBC) package as the primary description and exchange format.

Highly encouraging the use of SBML compliant ids should be the right thing to do here.

Jul 03 '20 06:07 matthiaskoenig

While I agree with your points on warning about invalid identifiers @matthiaskoenig and I certainly stand by the above support for SBML, we also have to face the reality that there are hundreds (thousands?) of models that follow the old conventions. I certainly don't expect every user of such a model to go and fix the IDs (especially since it is so unclear how and where they should do so in a sustainable manner) so we have to be careful at which point the warnings become detrimental.

Jul 03 '20 07:07 Midnighter

I don't think I advocated for not using SBML in my previous statement. On the contrary, I want people to use it whenever possible. I don't think that attacks on the community or me are conducive to reaching that goal. I rather asked for your advice on how we should treat this.

FWIW my preference would be always load IDs without modification from SBMl models and provide helpers to convert them to the legacy format if absolutely necessary. For saving SBML there should be an error that points towards a function that can convert IDs to SBML compliant format if the user does not want to do so themselves. This could use the ASCII substitutions as in the COBRA toolbox or any other method.

From what I can gather from our communication channels and questionnaires the part of the community that builds models themselves uses tools (Kbase, CarveME) that will produce SBML compliant IDs anyways. Those issues arise more often with models being passed on to researchers from their PI/collaborator and they are often just looking for a way of making it compliant with SBML in the easiest way possible.

Jul 03 '20 20:07 cdiener

Definitely backwards compatibility and reading of old models is a big issue. Having too many warnings is definitely detrimental to user experience, but at least one warning making aware of the problem when reading such old models should be shown to the user. Not sure if we could have something like a strict flag or something similar which would enable more strict warnings which can be disabled if people want to work with old models.

On Fri, Jul 3, 2020 at 10:47 PM Christian Diener [email protected] wrote:

I don't think I advocated for not using SBML in my previous statement. On the contrary, I want people to use it whenever possible. I don't think that personal attacks on the community or me are conducive to reaching that goal. I rather asked for your advice on how we should treat this.

FWIW my preference would be always load IDs without modification from SBMl models and provide helpers to convert them to the legacy format if absolutely necessary. For saving SBML there should be an error that points towards a function that can convert IDs to SBML compliant format if the user does not want to do so themselves. This could use the ASCII substitutions as in the COBRA toolbox or any other method.

From what I can gather from our communication channels and questionnaires the part of the community that builds models themselves uses tools (Kbase, CarveME) that will produce SBML compliant IDs anyways. Those issues arise more often with models being passed on to researchers from their PI/collaborator and they are often just looking for a way of making it compliant with SBML in the easiest way possible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/cobrapy/issues/955#issuecomment-653673848, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG33OXA6FRMDCDQDXUAQGTRZY7XJANCNFSM4M3RNNTQ .

-- Matthias König, PhD. Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt Universität zu Berlin, Institute of Biology, Institute for Theoretical Biology https://livermetabolism.com [email protected] https://twitter.com/konigmatt https://github.com/matthiaskoenig Tel: +49 30 2093 98435

Jul 03 '20 21:07 matthiaskoenig

R has this functionality where it collects many warnings and one can call warnings() to see them at a later point. Maybe we could issue different warnings, like "There were 99 problems with metabolite identifiers. Please run validate_sbml_model if you want to see the details.". What do you think?

Jul 04 '20 10:07 Midnighter

@Midnighter This sounds great. I was thinking in the same direction. This will also speed up the parsing of models with many warnings because we are not printing all the warnings, but just the overview warning. But it should be easy to get to the full set of warnings and issues with the model.

Jul 08 '20 09:07 matthiaskoenig

We are having the same problem, and in an extreme case it can crash the Python kernel. For example, a reaction with id A2____2___A is replaced with a non-alphanumeric ASCII character, which breaks the glpk solver and crashes the Python kernel.

The following code reproduces this error:

import cobra
model = cobra.io.load_model('BIOMD0000000586')

Crashing the Python kernel with:

glp_set_col_name: j = 27: column name contains invalid character(s)
Error detected in file ..\src\api\prob1.c at line 507

We have implemented this hack to temporally fix this, in case it is useful to anyone having this problem:

import cobra
from typing import Match
def _number_to_chr_safe(numberStr: Match) -> str:
    ascii = chr(int(numberStr.group(1)))
    return numberStr.group(1) if cobra.io.sbml.pattern_to_sbml.match(ascii) else ascii
cobra.io.sbml._number_to_chr = _number_to_chr_safe

Any advice for a better solution (besides implementing our own replacement functions) would be appreciated.

Jul 08 '22 14:07 dnlobo

My advice is

use valid SBML identifiers in your model
don't apply any replacements!

Personally I hate these replacements so much and they should be deactivated by default. It can't be that a software project has to deal with issues related to incorrect usage of a standard. E.g. nobody expects that cobrapy deals with invalid xml, but everybody expects that the parser should fix incorrect ids ?!

It would be great if we could have a vote of getting rid of the default replacements.

Jul 13 '22 09:07 matthiaskoenig

@dnlobo Just load your model without replacements

from cobra.io import read_sbml_model
model = read_sbml_model('BIOMD0000000586', f_replace={})

Jul 13 '22 09:07 matthiaskoenig

@matthiaskoenig thank you for your answer. I agree with you that the current replacements are very problematic, especially when they can crash the whole Python kernel. Removing them altogether and leave it to the user to define is certainly the easiest way to fix these problems in cobrapy.

Jul 14 '22 19:07 dnlobo

@matthiaskoenig, thank you for the workaround. My student lost around two weeks trying to get around this issue.

COBRAPY, you guys are awesome, please do not change model IDs without users explicitly asking for it.

Aug 18 '23 06:08 sshameer

Hello everyone, i try to create a model by using carveme -(with special universe named PhotoEukStein https://www.genoscope.cns.fr/PhotoEukStein/)

i obtain this one model_65.zip

i try to validate my model so i verified it : cobra.io.validate_sbml_model('./model_65.xml') but : glp_set_row_name: i = 1917: row name contains invalid character(s) Error detected in file api/prob1.c at line 450 Abandon (core dumped)

so as mention in #964 (https://github.com/opencobra/cobrapy/issues/964) it was the same problem of invalid characters so as @matthiaskoenig advice here i read the model with f_replace={} but now it's the long ids problem's.. So i take a look in #62 (https://github.com/matthiaskoenig/fbc_curation/issues/62) but i didn't achieve how to find and rename the affected id(s). i want to know how to do it to finally make my FBA. (it said here that normally this problem is fixed in optlang https://github.com/opencobra/cobrapy/issues/609 i'm a bit lost i have to admit ahah)

If everyone could help me i will be very gratefull !!

Feb 16 '24 16:02 Ed0u66

@Ed0u66 as a work around, if you install OSQP and HIGHSpy, you can use the hybrid optlang interface, which has no issues with "invalid" characters.

Feb 17 '24 14:02 Midnighter

cobrapy cobrapy copied to clipboard

Switch SBML parser defaults to not rewriting IDs

Problem description

Code Sample

Actual Output

Expected Output

cobrapy
cobrapy copied to clipboard