sssom-py
sssom-py copied to clipboard
Implementing separate methods for JSON and JSONLD
This PR adds methods
- parse_sssom_jsonld
- from_sssom_jsonld
- write_jsonld
- to_jsonld
- test_parse_sssom_jsonld
- test_write_sssom_jsonld
Which are exactly analogous to what was there before for JSON.
But its actual purpose is not so much to add those methods, but to carefully review the format (to make sure we are happy) so we can start making headway on https://github.com/mapping-commons/sssom/issues/321.
Breaking changes
jsonparameter now refers tojson, but used to refer tojsonld. So anyone expectingjsonldwill now be served withjson.
JSON Format
We need to make sure that the JSON format looks exactly as we envision it. Problems I see so far
- The biggest shortcoming of the JSON format at the moment is that json format does not have a
curie_map. We will probably have to https://github.com/mapping-commons/sssom/issues/225
Here is an example JSON file
{
"mapping_set_id": "https://w3id.org/sssom/mapping/tests/data/basic.tsv",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"mappings": [
{
"subject_id": "a:something",
"predicate_id": "rdfs:subClassOf",
"object_id": "b:something",
"mapping_justification": "semapv:LexicalMatching",
"subject_label": "XXXXX",
"subject_category": "biolink:AnatomicalEntity",
"object_label": "xxxxxx",
"object_category": "biolink:AnatomicalEntity",
"subject_source": "a:example",
"object_source": "b:example",
"mapping_tool": "rdf_matcher",
"confidence": 0.8,
"subject_match_field": [
"rdfs:label"
],
"object_match_field": [
"rdfs:label"
],
"match_string": [
"xxxxx"
],
"comment": "mock data"
},
{
"subject_id": "a:something",
"predicate_id": "owl:equivalentClass",
"object_id": "c:something",
"mapping_justification": "semapv:LexicalMatching",
"subject_label": "XYXYX",
"subject_category": "biolink:AnatomicalEntity",
"object_label": "xyxyxy",
"object_category": "biolink:AnatomicalEntity",
"subject_source": "a:example",
"object_source": "c:example",
"mapping_tool": "rdf_matcher",
"confidence": 0.83,
"subject_match_field": [
"rdfs:label"
],
"object_match_field": [
"rdfs:label"
],
"match_string": [
"xxxxx"
],
"comment": "mock data"
}
],
"creator_id": [
"orcid:1234",
"orcid:5678"
],
"mapping_tool": "https://github.com/cmungall/rdf_matcher",
"mapping_date": "2020-05-30"
}
The two remaining errors are also exactly due to this problem:
FAILED tests/test_conversion.py::SSSOMReadWriteTestSuite::test_conversion - AssertionError: 6 != 8 : JSON document has less elements than the orginal one for basic.tsv. Json: {"mapping_set_id": "https:...
FAILED tests/test_parsers.py::TestParseExplicit::test_round_trip_json - ValueError: {'UMLS', 'orcid', 'DOID'} are used in the SSSOM mapping set but it does not exist in the prefix map
We will probably have to https://github.com/mapping-commons/sssom/issues/225
The problem we might run into with that is that, as far as I know (and as I have noted in the discussion about the extension slots), LinkML does not have a map type. We’d want to declare a field that could be used like this:
"curie_map": {
"FBbt": "http://purl.obolibrary.org/obo/FBbt_"
}
but unless I missed something in LinkML’s docs, this is not possible. All we can do is to have a list (i.e. a “multi-valued” field) of custom “dictionary entry“ types, like this:
"curie_map": [
{ "key": "Fbbt",
"value": "http://purl.obolibrary.org/obo/FBbt_" }
]
which of course would work but would be… weird, at the very least.
My own solution (that nobody will like, I know) to that is simple: decide that CURIEfied identifiers are only for the TSV format (which is what the spec currently says, incidentally), JSON should only contain full-length identifiers. No CURIE map needed, problem solved.