sssom-py icon indicating copy to clipboard operation
sssom-py copied to clipboard

Implementing separate methods for JSON and JSONLD

Open matentzn opened this issue 1 year ago • 1 comments

This PR adds methods

  • parse_sssom_jsonld
  • from_sssom_jsonld
  • write_jsonld
  • to_jsonld
  • test_parse_sssom_jsonld
  • test_write_sssom_jsonld

Which are exactly analogous to what was there before for JSON.

But its actual purpose is not so much to add those methods, but to carefully review the format (to make sure we are happy) so we can start making headway on https://github.com/mapping-commons/sssom/issues/321.

Breaking changes

  • json parameter now refers to json, but used to refer to jsonld. So anyone expecting jsonld will now be served with json.

JSON Format

We need to make sure that the JSON format looks exactly as we envision it. Problems I see so far

  • The biggest shortcoming of the JSON format at the moment is that json format does not have a curie_map. We will probably have to https://github.com/mapping-commons/sssom/issues/225
Here is an example JSON file
{
  "mapping_set_id": "https://w3id.org/sssom/mapping/tests/data/basic.tsv",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "mappings": [
    {
      "subject_id": "a:something",
      "predicate_id": "rdfs:subClassOf",
      "object_id": "b:something",
      "mapping_justification": "semapv:LexicalMatching",
      "subject_label": "XXXXX",
      "subject_category": "biolink:AnatomicalEntity",
      "object_label": "xxxxxx",
      "object_category": "biolink:AnatomicalEntity",
      "subject_source": "a:example",
      "object_source": "b:example",
      "mapping_tool": "rdf_matcher",
      "confidence": 0.8,
      "subject_match_field": [
        "rdfs:label"
      ],
      "object_match_field": [
        "rdfs:label"
      ],
      "match_string": [
        "xxxxx"
      ],
      "comment": "mock data"
    },
    {
      "subject_id": "a:something",
      "predicate_id": "owl:equivalentClass",
      "object_id": "c:something",
      "mapping_justification": "semapv:LexicalMatching",
      "subject_label": "XYXYX",
      "subject_category": "biolink:AnatomicalEntity",
      "object_label": "xyxyxy",
      "object_category": "biolink:AnatomicalEntity",
      "subject_source": "a:example",
      "object_source": "c:example",
      "mapping_tool": "rdf_matcher",
      "confidence": 0.83,
      "subject_match_field": [
        "rdfs:label"
      ],
      "object_match_field": [
        "rdfs:label"
      ],
      "match_string": [
        "xxxxx"
      ],
      "comment": "mock data"
    }
  ],
  "creator_id": [
    "orcid:1234",
    "orcid:5678"
  ],
  "mapping_tool": "https://github.com/cmungall/rdf_matcher",
  "mapping_date": "2020-05-30"
}

The two remaining errors are also exactly due to this problem:

FAILED tests/test_conversion.py::SSSOMReadWriteTestSuite::test_conversion - AssertionError: 6 != 8 : JSON document has less elements than the orginal one for basic.tsv. Json: {"mapping_set_id": "https:...
FAILED tests/test_parsers.py::TestParseExplicit::test_round_trip_json - ValueError: {'UMLS', 'orcid', 'DOID'} are used in the SSSOM mapping set but it does not exist in the prefix map

matentzn avatar Feb 04 '24 11:02 matentzn

We will probably have to https://github.com/mapping-commons/sssom/issues/225

The problem we might run into with that is that, as far as I know (and as I have noted in the discussion about the extension slots), LinkML does not have a map type. We’d want to declare a field that could be used like this:

"curie_map": {
  "FBbt": "http://purl.obolibrary.org/obo/FBbt_"
}

but unless I missed something in LinkML’s docs, this is not possible. All we can do is to have a list (i.e. a “multi-valued” field) of custom “dictionary entry“ types, like this:

"curie_map": [
    { "key": "Fbbt",
      "value": "http://purl.obolibrary.org/obo/FBbt_" }
  ]

which of course would work but would be… weird, at the very least.

My own solution (that nobody will like, I know) to that is simple: decide that CURIEfied identifiers are only for the TSV format (which is what the spec currently says, incidentally), JSON should only contain full-length identifiers. No CURIE map needed, problem solved.

gouttegd avatar Feb 05 '24 10:02 gouttegd