RAVEN
RAVEN copied to clipboard
feat: importYaml and exportYaml functions
Description of the issue:
Adapt https://github.com/SysBioChalmers/Human-GEM/pull/392 for native support in RAVEN.
System information
- Please report:
- RAVEN version (stabile release,
devel
branch?) - Operating system (Windows/Mac/Linux; include version)
I hereby confirm that I have:
- [x] Followed the guidelines to install RAVEN.
- [x] Checked that a similar issue does not already exist
- [ ] If suitable, needed, asked first in the Gitter chat room about the issue
would be nice to preserve the git history when migrating the files
Not sure that is possible (or at least I don't know how to do it)? In addition, the function needs expansion to also read annotations and other fields that might not be in human-GEM
. Should be done in a few days.
would be nice to preserve the git history when migrating the files
@mihai-sysbio can you please help this
Preserving the detailed history might be too hard and potentially unnecessary, so I'm thinking an option would be to copy over several intermediate versions and commit these while specifying the author. In the commit message of the first commit I would propose to include a link to the PR in Human-GEM. And after this is done, I would imagine one or several commits that integrate these functions into RAVEN. This should provide a useful link to the original development.
The human-GEM function exportYaml
differs from RAVEN's writeYaml
(which I would like to rename to exportYaml
for consistency) in the following ways:
-
exportYaml
writes empty entries (e.g.- gene_reaction_rule: ""
if a reaction has an empty grRule.writeYaml
just does not write these entries. This seemed to have been introduced to facilitateimportYaml
(see line 234-245 here), but this is no longer required with the refactoredimportYaml
function. I suggest not to write empty lines. -
exportYaml
writes multiple eccodes in one line separated by;
whilewriteYaml
makes a list. This also seemed to have been introduced for the same reason as above. Again, I suggest to followwriteYaml
, as making a list is easier for diff-ing. -
exportYaml
has the following fields inmetaData
, that have different names inwriteYaml
, in the model and in the SBML file:
exportYaml |
writeYaml |
cobrapy (both python model structure and yaml file) | model. in MATLAB |
SBML | Example content |
---|---|---|---|---|---|
short_name | id | id | id |
id | Human-GEM |
full_name | name | description | description |
name | Generic genome-scale metabolic model of Homo sapiens |
github | sourceUrl | Content is not stored | annotation. sourceUrl |
Content is not stored | https://github.com/SysBioChalmers/Human-GEM |
description | note | Has model.notes, but does not seem to fill it when reading SBML. | annotation. note |
<notes> |
Genome-scale metabolic models are valuable tools to study metabolism and provide a scaffold for the integrative analysis of omics data. This is the latest version of Human-GEM, which is a genome-scale metabolic model of a generic human cell. The objective of Human-GEM is to serve as a community model for enabling integrative and mechanistic studies of human metabolism. |
From this table, I suggest to keep writeYaml
behaviour. This change will once be noticed when committing the updated human-GEM.yml
, but I assume the YAML file is not used for any other purposes which could break by changing field names?
Alternatively, can also hard-code human-GEM
-style as a flag.
These adjustments of integrating exportYaml
and writeYaml
seem reasonable, please go ahead.
My suggestion is that it might good to merge the two into one as writeYaml
, and rename importYaml
to readYaml
in RAVEN. This would avoid potential conflict (no changes has to be made to Human-GEM) and ensure a smooth transition once the pair of readYaml
/writeYaml
is matured later. Then the Human-GEM importYaml
/exportYaml
could be just deprecated after reformatting model structure to a more stable scheme. What do you think? @edkerk
As Metabolic Atlas relies on the YAML file to import models, all of the proposed changes would lead to breaking the import of models.
Here are my thoughts on the proposed changes:
exportYaml
writes empty entries (e.g.- gene_reaction_rule: ""
if a reaction has an empty grRule
To me, this has nothing to do with how the file is parsed. Even though the file is longer with empty values, a missing line can be difficult to interpret: was it a problem (exception) in writing that field, or is it really a missing value? Here I propose to follow the same thinking as in SBML, i.e. if they skip lines so may we.
exportYaml
writes multiple eccodes in one line separated by;
whilewriteYaml
makes a list. This also seemed to have been introduced for the same reason as above.
exportYaml
does the same for the references (PMIDs). I do see that it is easier diff-ing, but at the same time these are one-line changes of relatively short lines, so diff-ing is not that big of a problem to begin with. What matters more to me here is that missing values are easier to interpret as ""
, and that the resulting file will have fewer lines.
exportYaml
has the following fields inmetaData
, that have different names inwriteYaml
As a supporter of the metaData
section (could also be metadata
), I find some of the fields in writeYaml
potentially misleading:
-
id
: identifiers are meant to be unique, which is why in SBML they combine thename
with theversion
. However, value of this field is meant to be the same across versions, so I would favour a variant of*name
for this field. -
name
: if the other field is thename
, this is an expanded version of thename
. And I second that this is not thedescription
, since normally that can be interpreted as a sequence of sentences, which is really not the purpose of this field. -
sourceUrl
: I think this is a needed field in the YAML format, but there is a different meaning forgithub
, since that can be parsed in a different way, and can be represented with a GH logo as well. My suggestion would be to keep both fields, and to use thesourceUrl
for a version of the GH link that points to the release instead. -
note
: Thinking back of the reaction notes, anote
field is generally overlooked. Instead, the purpose of adescription
field is to have a larger chunk of text that gives much more information about the model than is presented in any of the prior fields, but that is still meant to be read (or shown on a website that integrates the model like Metabolic Atlas does).
In the specific case of Metabolic Atlas, we also have a different json
file where we could store these fields. The reason I am going so much into detail is that any website that would want to integrate the YAML format like we do would have the same problem, and really this is the easiest format to parse and work with in my view.
As Metabolic Atlas relies on the YAML file to import models, all of the proposed changes would lead to breaking the import of models.
Then it's great to get your input @mihai-sysbio, thanks!
Here are my thoughts on the proposed changes:
exportYaml
writes empty entries (e.g.- gene_reaction_rule: ""
if a reaction has an empty grRuleTo me, this has nothing to do with how the file is parsed. Even though the file is longer with empty values, a missing line can be difficult to interpret: was it a problem (exception) in writing that field, or is it really a missing value? Here I propose to follow the same thinking as in SBML, i.e. if they skip lines so may we.
To make sure I understand correctly: follow writeYaml
behaviour and do not write empty lines.
exportYaml
writes multiple eccodes in one line separated by;
whilewriteYaml
makes a list. This also seemed to have been introduced for the same reason as above.
exportYaml
does the same for the references (PMIDs). I do see that it is easier diff-ing, but at the same time these are one-line changes of relatively short lines, so diff-ing is not that big of a problem to begin with. What matters more to me here is that missing values are easier to interpret as""
, and that the resulting file will have fewer lines.
To make sure I understand correctly: you again advocate not to write empty lines (see above), but you're indifferent about keeping one-lines or splitting them into list. I also do not have a strong opinion about this one, but splitting into list would be more consistent behaviour when looking e.g. at subsystems and references.
exportYaml
has the following fields inmetaData
, that have different names inwriteYaml
As a supporter of the
metaData
section (could also bemetadata
), I find some of the fields inwriteYaml
potentially misleading:
The idea has been to remain close to the SBML file format, to avoid confusion.
id
: identifiers are meant to be unique, which is why in SBML they combine thename
with theversion
. However, value of this field is meant to be the same across versions, so I would favour a variant of*name
for this field.
Here I need to correct that SBML does not combine name
with version
, that is something we started to do with yeast-GEM
as SBML does not have a dedicated version field. I could not find anything in the SBML specification, or BioModels.net qualifiers on how to indicate a model version. Note that human-GEM.xml
currently has no mention of version number.
Regardless, id
is a pretty bad spot to store version number anyway, as period is an illegal character in SBML ids and results in something like: <model metaid="yeastGEM_v8.6.0" id="yeastGEM_v8__46__6__46__0" name="The Consensus Genome-Scale Metabolic Model of Yeast" fbc:strict="true">
.
So is the core of this question then not whether id
and name
should instead be named short_name
and full_name
, but rather where the version information should be stored in the SBML file? It seems odd to me to redefine this field as short_name
, while SBML and (afaik) all modelling software call it id
.
name
: if the other field is thename
, this is an expanded version of thename
. And I second that this is not thedescription
, since normally that can be interpreted as a sequence of sentences, which is really not the purpose of this field.
See above, my argument for keeping id
, I see no direct reason for full_name
. Looking at consistency there are then still two options: description
of name
. Either of these are fine for me, but it of course depends on the issue two bullet-points down.
sourceUrl
: I think this is a needed field in the YAML format, but there is a different meaning forgithub
, since that can be parsed in a different way, and can be represented with a GH logo as well. My suggestion would be to keep both fields, and to use thesourceUrl
for a version of the GH link that points to the release instead.
Sure, both can be supported, and the user can decide which to use.
Would also be nice to represent this in the SBML file, but similar as with the model version, there does not seem to be a defined way to do this.
note
: Thinking back of the reaction notes, anote
field is generally overlooked. Instead, the purpose of adescription
field is to have a larger chunk of text that gives much more information about the model than is presented in any of the prior fields, but that is still meant to be read (or shown on a website that integrates the model like Metabolic Atlas does).
See above about description
.
In the specific case of Metabolic Atlas, we also have a different
json
file where we could store these fields. The reason I am going so much into detail is that any website that would want to integrate the YAML format like we do would have the same problem, and really this is the easiest format to parse and work with in my view.
Sure, I completely agree with making the YAML file as versatile as possible! Still, considering that XML is the standard file format for systems biology models, it would be good not to neglect it and probably not to unnecessarily deviate too much from it.
To summarize (checked where we agree):
- [x] do not write empty lines
- [ ] write ec-codes as single line or list (still undecided, either way fine for me)
- [x] keep
id
(or whatever name the field will have) constant and store version number elsewhere - [ ] use
id
orshort_name
(still undecided, I advocate forid
) - [ ] use
name
,full_name
ordescription
(still undecided,full_name
only makes sense ifshort_name
is also used, eithername
ordescription
seems fine for me) - [x] keep both
sourceUrl
andgithub
as alternative fields. User can e.g. decide to link to a specific release in thesourceUrl
. - [ ] use
note
ordescription
(still undecided, either are fine for me, but there should of course not be 2description
fields) - [ ] unclear where in SBML to store
version
,sourceUrl
,github
.
These adjustments of integrating
exportYaml
andwriteYaml
seem reasonable, please go ahead.My suggestion is that it might good to merge the two into one as
writeYaml
, and renameimportYaml
toreadYaml
in RAVEN. This would avoid potential conflict (no changes has to be made to Human-GEM) and ensure a smooth transition once the pair ofreadYaml
/writeYaml
is matured later. Then the Human-GEMimportYaml
/exportYaml
could be just deprecated after reformatting model structure to a more stable scheme. What do you think? @edkerk
Just to note that importYaml
on RAVEN already gives identical results as the one on human-GEM
repo, it is just faster and loads additional information (e.g. annotations). The issues I raised here mainly have to do with exportYaml
.
I would have preferred to use exportYaml
and importYaml
, to be consistent with RAVEN's importModel
and exportModel
. But actually, these latter RAVEN functions can be reorganized as wrappers for both SBML and YAML I/O. Dependent on the specified file extension, it imports/exports SBML or YAML files. The SBML I/O functionality would then be moved to readSBMLmodel
and writeSBMLmodel
(readSBML
and writeSBML
are already in COBRA), and the YAML functions are named readYAMLmodel
and writeYAMLmodel
(to at least have some consistency :smile: ).