RAVEN icon indicating copy to clipboard operation
RAVEN copied to clipboard

feat: importYaml and exportYaml functions

Open edkerk opened this issue 2 years ago • 9 comments

Description of the issue:

Adapt https://github.com/SysBioChalmers/Human-GEM/pull/392 for native support in RAVEN.

System information

  • Please report:
  1. RAVEN version (stabile release, devel branch?)
  2. Operating system (Windows/Mac/Linux; include version)

I hereby confirm that I have:

edkerk avatar Jun 15 '22 14:06 edkerk

would be nice to preserve the git history when migrating the files

haowang-bioinfo avatar Jun 17 '22 00:06 haowang-bioinfo

Not sure that is possible (or at least I don't know how to do it)? In addition, the function needs expansion to also read annotations and other fields that might not be in human-GEM. Should be done in a few days.

edkerk avatar Jun 17 '22 07:06 edkerk

would be nice to preserve the git history when migrating the files

@mihai-sysbio can you please help this

haowang-bioinfo avatar Jun 17 '22 08:06 haowang-bioinfo

Preserving the detailed history might be too hard and potentially unnecessary, so I'm thinking an option would be to copy over several intermediate versions and commit these while specifying the author. In the commit message of the first commit I would propose to include a link to the PR in Human-GEM. And after this is done, I would imagine one or several commits that integrate these functions into RAVEN. This should provide a useful link to the original development.

mihai-sysbio avatar Jun 17 '22 09:06 mihai-sysbio

The human-GEM function exportYaml differs from RAVEN's writeYaml (which I would like to rename to exportYaml for consistency) in the following ways:

  • exportYaml writes empty entries (e.g. - gene_reaction_rule: "" if a reaction has an empty grRule. writeYaml just does not write these entries. This seemed to have been introduced to facilitate importYaml (see line 234-245 here), but this is no longer required with the refactored importYaml function. I suggest not to write empty lines.
  • exportYaml writes multiple eccodes in one line separated by ; while writeYaml makes a list. This also seemed to have been introduced for the same reason as above. Again, I suggest to follow writeYaml, as making a list is easier for diff-ing.
  • exportYaml has the following fields in metaData, that have different names in writeYaml, in the model and in the SBML file:
exportYaml writeYaml cobrapy (both python model structure and yaml file) model. in MATLAB SBML Example content
short_name id id id id Human-GEM
full_name name description description name Generic genome-scale metabolic model of Homo sapiens
github sourceUrl Content is not stored annotation. sourceUrl Content is not stored https://github.com/SysBioChalmers/Human-GEM
description note Has model.notes, but does not seem to fill it when reading SBML. annotation. note <notes> Genome-scale metabolic models are valuable tools to study metabolism and provide a scaffold for the integrative analysis of omics data. This is the latest version of Human-GEM, which is a genome-scale metabolic model of a generic human cell. The objective of Human-GEM is to serve as a community model for enabling integrative and mechanistic studies of human metabolism.

From this table, I suggest to keep writeYaml behaviour. This change will once be noticed when committing the updated human-GEM.yml, but I assume the YAML file is not used for any other purposes which could break by changing field names? Alternatively, can also hard-code human-GEM-style as a flag.

edkerk avatar Jun 21 '22 23:06 edkerk

These adjustments of integrating exportYaml and writeYaml seem reasonable, please go ahead.

My suggestion is that it might good to merge the two into one as writeYaml, and rename importYaml to readYaml in RAVEN. This would avoid potential conflict (no changes has to be made to Human-GEM) and ensure a smooth transition once the pair of readYaml/writeYaml is matured later. Then the Human-GEM importYaml/exportYaml could be just deprecated after reformatting model structure to a more stable scheme. What do you think? @edkerk

haowang-bioinfo avatar Jun 23 '22 04:06 haowang-bioinfo

As Metabolic Atlas relies on the YAML file to import models, all of the proposed changes would lead to breaking the import of models.

Here are my thoughts on the proposed changes:

  • exportYaml writes empty entries (e.g. - gene_reaction_rule: "" if a reaction has an empty grRule

To me, this has nothing to do with how the file is parsed. Even though the file is longer with empty values, a missing line can be difficult to interpret: was it a problem (exception) in writing that field, or is it really a missing value? Here I propose to follow the same thinking as in SBML, i.e. if they skip lines so may we.

  • exportYaml writes multiple eccodes in one line separated by ; while writeYaml makes a list. This also seemed to have been introduced for the same reason as above.

exportYaml does the same for the references (PMIDs). I do see that it is easier diff-ing, but at the same time these are one-line changes of relatively short lines, so diff-ing is not that big of a problem to begin with. What matters more to me here is that missing values are easier to interpret as "", and that the resulting file will have fewer lines.

  • exportYaml has the following fields in metaData, that have different names in writeYaml

As a supporter of the metaData section (could also be metadata), I find some of the fields in writeYaml potentially misleading:

  • id: identifiers are meant to be unique, which is why in SBML they combine the name with the version. However, value of this field is meant to be the same across versions, so I would favour a variant of *name for this field.
  • name: if the other field is the name, this is an expanded version of the name. And I second that this is not the description, since normally that can be interpreted as a sequence of sentences, which is really not the purpose of this field.
  • sourceUrl: I think this is a needed field in the YAML format, but there is a different meaning for github, since that can be parsed in a different way, and can be represented with a GH logo as well. My suggestion would be to keep both fields, and to use the sourceUrl for a version of the GH link that points to the release instead.
  • note: Thinking back of the reaction notes, a note field is generally overlooked. Instead, the purpose of a description field is to have a larger chunk of text that gives much more information about the model than is presented in any of the prior fields, but that is still meant to be read (or shown on a website that integrates the model like Metabolic Atlas does).

In the specific case of Metabolic Atlas, we also have a different json file where we could store these fields. The reason I am going so much into detail is that any website that would want to integrate the YAML format like we do would have the same problem, and really this is the easiest format to parse and work with in my view.

mihai-sysbio avatar Jun 23 '22 05:06 mihai-sysbio

As Metabolic Atlas relies on the YAML file to import models, all of the proposed changes would lead to breaking the import of models.

Then it's great to get your input @mihai-sysbio, thanks!

Here are my thoughts on the proposed changes:

  • exportYaml writes empty entries (e.g. - gene_reaction_rule: "" if a reaction has an empty grRule

To me, this has nothing to do with how the file is parsed. Even though the file is longer with empty values, a missing line can be difficult to interpret: was it a problem (exception) in writing that field, or is it really a missing value? Here I propose to follow the same thinking as in SBML, i.e. if they skip lines so may we.

To make sure I understand correctly: follow writeYaml behaviour and do not write empty lines.

  • exportYaml writes multiple eccodes in one line separated by ; while writeYaml makes a list. This also seemed to have been introduced for the same reason as above.

exportYaml does the same for the references (PMIDs). I do see that it is easier diff-ing, but at the same time these are one-line changes of relatively short lines, so diff-ing is not that big of a problem to begin with. What matters more to me here is that missing values are easier to interpret as "", and that the resulting file will have fewer lines.

To make sure I understand correctly: you again advocate not to write empty lines (see above), but you're indifferent about keeping one-lines or splitting them into list. I also do not have a strong opinion about this one, but splitting into list would be more consistent behaviour when looking e.g. at subsystems and references.

  • exportYaml has the following fields in metaData, that have different names in writeYaml

As a supporter of the metaData section (could also be metadata), I find some of the fields in writeYaml potentially misleading:

The idea has been to remain close to the SBML file format, to avoid confusion.

  • id: identifiers are meant to be unique, which is why in SBML they combine the name with the version. However, value of this field is meant to be the same across versions, so I would favour a variant of *name for this field.

Here I need to correct that SBML does not combine name with version, that is something we started to do with yeast-GEM as SBML does not have a dedicated version field. I could not find anything in the SBML specification, or BioModels.net qualifiers on how to indicate a model version. Note that human-GEM.xml currently has no mention of version number.

Regardless, id is a pretty bad spot to store version number anyway, as period is an illegal character in SBML ids and results in something like: <model metaid="yeastGEM_v8.6.0" id="yeastGEM_v8__46__6__46__0" name="The Consensus Genome-Scale Metabolic Model of Yeast" fbc:strict="true">.

So is the core of this question then not whether id and name should instead be named short_name and full_name, but rather where the version information should be stored in the SBML file? It seems odd to me to redefine this field as short_name, while SBML and (afaik) all modelling software call it id.

  • name: if the other field is the name, this is an expanded version of the name. And I second that this is not the description, since normally that can be interpreted as a sequence of sentences, which is really not the purpose of this field.

See above, my argument for keeping id, I see no direct reason for full_name. Looking at consistency there are then still two options: description of name. Either of these are fine for me, but it of course depends on the issue two bullet-points down.

  • sourceUrl: I think this is a needed field in the YAML format, but there is a different meaning for github, since that can be parsed in a different way, and can be represented with a GH logo as well. My suggestion would be to keep both fields, and to use the sourceUrl for a version of the GH link that points to the release instead.

Sure, both can be supported, and the user can decide which to use.

Would also be nice to represent this in the SBML file, but similar as with the model version, there does not seem to be a defined way to do this.

  • note: Thinking back of the reaction notes, a note field is generally overlooked. Instead, the purpose of a description field is to have a larger chunk of text that gives much more information about the model than is presented in any of the prior fields, but that is still meant to be read (or shown on a website that integrates the model like Metabolic Atlas does).

See above about description.

In the specific case of Metabolic Atlas, we also have a different json file where we could store these fields. The reason I am going so much into detail is that any website that would want to integrate the YAML format like we do would have the same problem, and really this is the easiest format to parse and work with in my view.

Sure, I completely agree with making the YAML file as versatile as possible! Still, considering that XML is the standard file format for systems biology models, it would be good not to neglect it and probably not to unnecessarily deviate too much from it.

To summarize (checked where we agree):

  • [x] do not write empty lines
  • [ ] write ec-codes as single line or list (still undecided, either way fine for me)
  • [x] keep id (or whatever name the field will have) constant and store version number elsewhere
  • [ ] use id or short_name (still undecided, I advocate for id)
  • [ ] use name, full_name or description (still undecided, full_name only makes sense if short_name is also used, either name or description seems fine for me)
  • [x] keep both sourceUrl and github as alternative fields. User can e.g. decide to link to a specific release in the sourceUrl.
  • [ ] use note or description (still undecided, either are fine for me, but there should of course not be 2 description fields)
  • [ ] unclear where in SBML to store version, sourceUrl, github.

edkerk avatar Jun 23 '22 11:06 edkerk

These adjustments of integrating exportYaml and writeYaml seem reasonable, please go ahead.

My suggestion is that it might good to merge the two into one as writeYaml, and rename importYaml to readYaml in RAVEN. This would avoid potential conflict (no changes has to be made to Human-GEM) and ensure a smooth transition once the pair of readYaml/writeYaml is matured later. Then the Human-GEM importYaml/exportYaml could be just deprecated after reformatting model structure to a more stable scheme. What do you think? @edkerk

Just to note that importYaml on RAVEN already gives identical results as the one on human-GEM repo, it is just faster and loads additional information (e.g. annotations). The issues I raised here mainly have to do with exportYaml.

I would have preferred to use exportYaml and importYaml, to be consistent with RAVEN's importModel and exportModel. But actually, these latter RAVEN functions can be reorganized as wrappers for both SBML and YAML I/O. Dependent on the specified file extension, it imports/exports SBML or YAML files. The SBML I/O functionality would then be moved to readSBMLmodel and writeSBMLmodel (readSBML and writeSBML are already in COBRA), and the YAML functions are named readYAMLmodel and writeYAMLmodel (to at least have some consistency :smile: ).

edkerk avatar Jun 23 '22 11:06 edkerk