sssom-py icon indicating copy to clipboard operation
sssom-py copied to clipboard

Merged MappingSetDataFrames do not contain the same metadata as the sources

Open hrshdhgd opened this issue 3 years ago • 6 comments

basic.tsv has the following metadata:

#license: "https://creativecommons.org/publicdomain/zero/1.0/"
#mapping_set_id: http://w3id.org/sssom/mapping/tests/data/basic.tsv
#mapping_tool: "https://github.com/cmungall/rdf_matcher"
#creator_id: "cjm"
#mapping_date: "2020-05-30"
#curie_map:
#  a: "http://example.org/a/"
#  rdfs: "http://www.w3.org/2000/01/rdf-schema#"
#  owl: "http://www.w3.org/2002/07/owl#"
#  x: "http://example.org/x/"
#  y: "http://example.org/y/"
#  z: "http://example.org/z/"
#  b: "http://example.org/b/"
#  c: "http://example.org/c/"
#  d: "http://example.org/d/"

and basic2.tsv has the following metadata:

#license: "https://creativecommons.org/publicdomain/zero/1.0/"
#mapping_set_id: http://w3id.org/sssom/mapping/tests/data/basic2.tsv
#mapping_tool: "https://github.com/cmungall/rdf_matcher"
#creator_id: "cjm"
#mapping_date: "2020-05-30"
#curie_map:
#  a: "http://example.org/a/"
#  rdfs: "http://www.w3.org/2000/01/rdf-schema#"
#  owl: "http://www.w3.org/2002/07/owl#"
#  x: "http://example.org/x/"
#  y: "http://example.org/y/"
#  z: "http://example.org/z/"
#  b: "http://example.org/b/"
#  c: "http://example.org/c/"
#  d: "http://example.org/d/"

As of now when I do a merge between the two , the resultant msdf (merged_msdf) has just the curie_map as the metadata on top. The other stuff [everything above the curie_map] are individual columns in the merged_msdf

# curie_map:
#   a: http://example.org/a/
#   b: http://example.org/b/
#   c: http://example.org/c/
#   d: http://example.org/d/
#   owl: http://www.w3.org/2002/07/owl#
#   rdfs: http://www.w3.org/2000/01/rdf-schema#
#   skos: http://www.w3.org/2004/02/skos/core#
#   sssom: http://w3id.org/sssom/
#   x: http://example.org/x/
#   y: http://example.org/y/
#   z: http://example.org/z/

Is this correct?

The reason I ask is when I export merged_msdf into a tsv and read it back again (using read_sssom_table), it gives me an error saying there is no license or mapping_set_id in the sssom file being read. What should we do?

cc: @cmungall , @matentzn

hrshdhgd avatar Jan 13 '22 00:01 hrshdhgd

Hmm good question. Maybe we should add another parameter to merge to supply the fresh metadata, and require the license and id to be set?

matentzn avatar Jan 17 '22 15:01 matentzn

Just listing down the immediate questions I have here for discussion:

  • Shouldn't adding the license or mapping_set_id be by default to make the merged file a legitimate msdf while importing?
  • I do realize it is redundant information in the same file but is there an alternative?
  • Also, while using metadata from the multiple files being merged:
    • Ideally the values to the same keys should be the same. What do we do when they are different?

hrshdhgd avatar Jan 17 '22 16:01 hrshdhgd

This is a more profound question than it appears at first.

Imagine you create a new mapping set from two mappings sets. One is using cc-by4 and one using cc-by3 as license. Then you merge them. What happens? What kind license is the resulting one? Either we using sssom:unspecified as a placeholder, or we have people add the license somehow. Maybe that is the better way - we don't want to get into the business of guessing licenses. for the mapping set ID, you could autogenerate one with a UUID. What do you think?

matentzn avatar Jan 17 '22 16:01 matentzn

(yes we absolutely need the license and ID, but during a merge its unclear which one to take)

matentzn avatar Jan 17 '22 16:01 matentzn

Ideally the values to the same keys should be the same. What do we do when they are different?

FAIL! No doubt.

EDIT: Actually, lets think this through in the next call. Maybe don't fail. Maybe just prefer the prefix mappings from a over the ones from b. LinkML should ensure that we don't care that we change a prefix or a prefix URL from one msdf to another.

matentzn avatar Jan 17 '22 16:01 matentzn

I think failing is the best short term fix for both prefix clashes and license clashes

We can imagine options to help people repair these - e.g. rewiring CURIEs, or even license reasoning such that the least permissive license wins. But I think it's just easier in the short term to fail fast and force the user to fix the header.

maybe thing about repairs in the context of a separate command, along the lines of robot annotation, robot rename

(for license clashes, I also think having this become sssom:unspecified is OK, I didn't realize that was even an option, it kind of takes the mickey out this being a required field, would be cleaner to have it be not required)

cmungall avatar Jan 18 '22 22:01 cmungall