cobrapy icon indicating copy to clipboard operation
cobrapy copied to clipboard

Cannot export SBML when annotation is a list of lists

Open zakandrewking opened this issue 5 years ago • 16 comments

Problem description

The JSON format allows annotations as lists of lists, but trying to export such a model as SBML causes an error.

This is the case for the BiGG Models universal model (http://bigg.ucsd.edu/data_access). If that model is incorrectly formatted, I can fix it, but it might be good to also add some warning in COBRApy.

Might be related to #580. Originally reported as https://github.com/SBRG/bigg_models/issues/299.

(Thanks!!!!)

Code Sample

Model:

{
  "metabolites":[
    {
      "id":"4crsol_c",
      "name":"",
      "compartment":"",
      "annotation":[
        [
          "KEGG Compound",
          "http://identifiers.org/kegg.compound/C01468"
        ],
        [
          "CHEBI",
          "http://identifiers.org/chebi/CHEBI:11981"
        ]
      ]
    }
  ],
  "reactions":[],
  "genes":[],
  "id":"test",
  "compartments":{},
  "version":"1"
}

Script:

import cobra
m = cobra.io.load_json_model('test.json')
cobra.io.write_sbml_model(m, 'test.xml')

Actual Output

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-a2868fcb5a70> in <module>()
----> 1 cobra.io.write_sbml_model(m, 'test.xml')

/usr/local/lib/python3.6/site-packages/cobra/io/sbml3.py in write_sbml_model(cobra_model, filename, use_fbc_package, **kwargs)
    715         return
    716     # create xml
--> 717     xml = model_to_xml(cobra_model, **kwargs)
    718     write_args = {"encoding": "UTF-8", "xml_declaration": True}
    719     if _with_lxml:

/usr/local/lib/python3.6/site-packages/cobra/io/sbml3.py in model_to_xml(cobra_model, units)
    482                              hasOnlySubstanceUnits="false")
    483         set_attrib(species, "name", met.name)
--> 484         annotate_sbml_from_cobra(species, met)
    485         set_attrib(species, "compartment", met.compartment)
    486         set_attrib(species, "fbc:charge", met.charge)

/usr/local/lib/python3.6/site-packages/cobra/io/sbml3.py in annotate_sbml_from_cobra(sbml_element, cobra_element)
    242     bag = SubElement(SubElement(rdf_desc, ns("bqbiol:is")),
    243                      ns("rdf:Bag"))
--> 244     for provider, identifiers in sorted(iteritems(cobra_element.annotation)):
    245         if provider == "SBO":
    246             set_attrib(sbml_element, "sboTerm", identifiers)

/usr/local/lib/python3.6/site-packages/six.py in iteritems(d, **kw)
    585
    586     def iteritems(d, **kw):
--> 587         return iter(d.items(**kw))
    588
    589     def iterlists(d, **kw):

AttributeError: 'list' object has no attribute 'items'

Expected Output

Valid SBML model.

Dependency Information

Please paste the output of python -c "from cobra import show_versions; show_versions()" in the details block below.

System Information

OS Darwin OS-release 17.6.0 Python 3.6.5

Package Versions

cobra 0.13.0 depinfo 1.3.0 future 0.16.0 numpy 1.14.2 optlang 1.4.2 pandas 0.22.0 pip 10.0.1 ruamel.yaml 0.14.12 setuptools 39.2.0 six 1.11.0 swiglpk 1.4.4 tabulate 0.8.2 wheel 0.31.1

zakandrewking avatar Jul 16 '18 21:07 zakandrewking

This will be solved as part of #684

matthiaskoenig avatar Mar 18 '19 09:03 matthiaskoenig

This is a JSON import bug, not an SBML related bug. When reading the json the list of lists must be converted in correct annotations.

matthiaskoenig avatar Apr 12 '19 15:04 matthiaskoenig

I also had this issue when I was trying to parse universal_model to sbml which still shows annotations in the form of list of lists. But when I checked the json_schema for annotations, it specify annotations to be of type object :

       "annotation": {"type": "object"},

And performing a validation check over this json using the corresponding json_schema do points the error :

       from jsonschema import validate
       validate(json.load(file_handle), json_schema)

Error :

Traceback (most recent call last): File "issue736.py", line 110, in ans = validate(json.load(file_handle), json_schema) File "/home/hemant/environment/cobrapyenv/lib/python3.7/site-packages/jsonschema/validators.py", line 934, in validate raise error jsonschema.exceptions.ValidationError: [['KEGG Compound', 'http://identifiers.org/kegg.compound/C01468'], ['CHEBI', 'http://identifiers.org/chebi/CHEBI:11981']] is not of type 'object'

Failed validating 'type' in schema['properties']['metabolites']['items']['properties']['annotation']: {'type': 'object'}

On instance['metabolites'][0]['annotation']: [['KEGG Compound', 'http://identifiers.org/kegg.compound/C01468'], ['CHEBI', 'http://identifiers.org/chebi/CHEBI:11981']]


So I think json_scheme is upto date but the error exist with the models where annotations are list of lists. Because other models like e_coli_core where annotations values are json objects passes the validation test. So @matthiaskoenig , can you please help me understand how you want this issue to be solved. Should there be any change in the json_schema to allow annotations to be list of lists and further in model_from_dict() method of cobra.io.dict.py where the parsing is done? Thanks.

Hemant27031999 avatar Jan 26 '20 06:01 Hemant27031999

A JSON object is not the same as an object in Python but rather a something like a dict (https://restfulapi.net/json-objects/). @zakandrewking when you said that the JSON schema allows annotations as list which source did you refer to? The one currently implemented in cobrapy does not seem to allow it.

cdiener avatar Jan 27 '20 00:01 cdiener

I see this in a similar light to you, Christian. To me what should possibly happen in a next version of the JSON schema is:

  1. Implement biology qualifiers, thus annotations would turn into an object of lists of pairs (technically a list of list due to JSON not supporting tuples.
  2. There was an idea of ensuring that annotations are always lists rather than turning single element lists into strings.

An example of an annotation is then:

"annotation": {
  "bigg.reaction": [["is", "PFK26"]],
  "kegg.reaction": [["is", "R02732"]],
  "rhea": [["is", "15656"]]
}

Midnighter avatar Jan 27 '20 11:01 Midnighter

This looks good to me and has the advantage that is already valid under the current schema which makes it backwards compatible.

cdiener avatar Jan 27 '20 16:01 cdiener

A JSON object is not the same as an object in Python but rather a something like a dict (https://restfulapi.net/json-objects/). @zakandrewking when you said that the JSON schema allows annotations as list which source did you refer to? The one currently implemented in cobrapy does not seem to allow it.

I just tested the steps from my original posting, and they are still the case with COBRA 0.17.1. It's not that the schema is wrong but that COBRApy doesn't stop you from using it this way.

zakandrewking avatar Jan 27 '20 23:01 zakandrewking

Oh I see so the problem is that loading it does not throw an error because it should not pass the schema.

cdiener avatar Jan 28 '20 01:01 cdiener

That was also one thing that confused me. Though the json_schema is defined in cobra.core.json.py file, it is nowhere used as a validator to validate the model which we are loading from the external json file. The load_json_model(filename) method of cobra.core.json.py directly parse the json into dictionary without any validation.The above error which I caught came when I myself checked the json against the json_schema. If validation is done while parsing the model from the json file, the 'list of lists' error or any other possible error will be caught there only instead of it being rising at the time of making of sbml or any other kind of model.

Hemant27031999 avatar Jan 28 '20 14:01 Hemant27031999

I have some local code that is unfortunately not quite ready. It uses pydantic models to define and validate the schema. Because it is a little more work it is also delaying #720.

Midnighter avatar Jan 28 '20 14:01 Midnighter

@Hemant27031999 yeah I saw the the same thing. I think this is due to schema validation being really slow and reading JSON is currently really fast. Also most validation is performed in io/dict.py on the dict so it can be used for yaml as well. For the annotations one easy way could be to adjust the Object class directly and add a setter for annotations that only allows dicts. That looks like the safest option as it would propagate to all possible ways to get a model (de novo, from SBML, JSON, etc).

cdiener avatar Jan 28 '20 16:01 cdiener

Yeah, I think that can be the best thing to do. Not only it will stop the complete validation check on the json model using the schema that can make the process slow, it will even catch the error of "list of lists" while reading the model from the json file. Can I work on this issue and submit a PR for the same if you say?

Hemant27031999 avatar Jan 30 '20 19:01 Hemant27031999

Sure go ahead. The Object class is implemented in core/object.py. Would be great if you could also add a test as well. You can add me as a reviewer for the PR.

cdiener avatar Jan 30 '20 19:01 cdiener

I am working on my proposal for GSoC 2020 where I have to add the complete support for annotation in JSON format which, right now, have many issues. I have gone through the SBML level 3, version 2 specification doc for annotation in section 6 and have figured out following three feasible data format in SBML-JSON form.

SBML Annotation :

<annotation>
           <rdf:RDF …(other namespaces)... >
             <rdf:Description rdf:about="#G_b0351">
               <bqbiol:is>
                 <rdf:Bag>
                  	<rdf:li rdf:resource="http://identifiers.org/uniprot/P77580" />
    		 	<rdf:li rdf:resource="http://identifiers.org/uniprot/P0A6A3" />
                 </rdf:Bag>
               </bqbiol:is>
               <bqbiol:isEncodedBy>
                 <rdf:Bag>
                        <rdf:li rdf:resource="http://identifiers.org/asap/ABE-0001207" />
                        <rdf:li rdf:resource="http://identifiers.org/ecogene/EG13625" />
                        <rdf:li rdf:resource="http://identifiers.org/ncbigene/945008" />
                        <rdf:li rdf:resource="http://identifiers.org/ncbigene/945008" />
                 </rdf:Bag>
               </bqbiol:isEncodedBy>
             </rdf:Description>
           </rdf:RDF>
  </annotation>

Format 1 :

annotation = {
	“is” : [“uniprot/P77580”, “uniprot/P0A6A3”],
	“isEncodedBy” :  [“asap/ABE-0001207”, ”ecogene/EG13625”,
   	 	    “ncbigene/945008”, “ncbigene/945008”]
}

Format 2 :

annotation = {
	“uniprot” : [(“is”, “P77580”), (“is”, “P0A6A3”)],
	“asap” : [(“isEncodedBy” ,“ABE-0001207”)],
	“ecogene” : [(“isEncodedBy”, “EG13625”)],
	“ncbigene” : [(“isEncodedBy”,“945008”), (“isEncodedBy”,”945008”)]
}

Format 3:

annotation = {
	“uniprot” : {“is” : [“P77580” , “P0A6A3”]},
	“asap” : {“isEncodedBy” : [“ABE-0001207”]},
	“ecogene” : {“isEncodedBy” : [“EG13625”]},
	“ncbigene” : {“isEncodedBy” : [“945008”, ”945008”]}
} 

Out of the above three formats, format 2 is somewhat similar to one discussed above (suggested by @Midnighter ) in above discussion. This format is good, but it has a repetition of biological identifier which makes it a little bit lengthy. Format 1 has identifier as key and list of resources as value. It looks a little more compact and also follow structure similar to one followed in SBML itself i.e resources listed under biological qualifier name. Format 3 is also compact but it looks a little scattered. So I think format 1 will be most suitable out of the three. I thought it would be better if I once discuss this with the members of cobrapy. Please have a look at it and let me know if it is fine or not. Thanks.

Hemant27031999 avatar Mar 11 '20 13:03 Hemant27031999

Awesome, thank you for putting in the extra effort here @Hemant27031999. I have opted for format 2 in the pydantic models created in cobra-component-models. In particular, you can look at the definition of the class for annotations which is used in the common base class for species, compartments, and reactions.

I opted for this slightly more verbose option because it is closest to the current JSON schema.

I'm not a fan of format 1 at all because separating namespace and identifier will require a lot of string manipulation and it is the biggest change from the existing format.

Midnighter avatar Mar 11 '20 13:03 Midnighter

In my case, I just get rid of annotations completely:

for reaction in model.reactions: reaction.annotation = {}

ensakz avatar Oct 22 '20 07:10 ensakz

This has derailed a bit into discussing the new annotation format. The original issue has been fixed by enforcing the schema.

cdiener avatar Nov 04 '22 19:11 cdiener