sssom icon indicating copy to clipboard operation
sssom copied to clipboard

Trace how one mapping was derived from a( set of o)nther?

Open matentzn opened this issue 2 years ago • 4 comments

When

  • creating reference mappings based on multiple sets of candidate mappings or
  • creating new mappings based on walks of other mappings

we currently have no way to capture this, because we also have no way to identify a mapping at the moment (unique mapping id). Not sure how to deal with that in a practical way.

matentzn avatar Sep 24 '21 10:09 matentzn

We could define a hash function over the subject/relation/target that first canonicalizes the curies (using the bioregistry, for example) then use that as an identifier for a given mapping. Or we could string concatenation them together with another delimiter we didn’t expect to see in the identifiers, like a pipe

if we can be creative, we could create a data model for describing a sequence of transformations applied to a given mapping or set of mappings (I don’t think this would fit inside sssom itself, though)

cthoyt avatar Sep 24 '21 10:09 cthoyt

This is not enough for the walking provenance problem, and there is a problem that an SSSOM file can contain the same mapping twice..

matentzn avatar Sep 24 '21 11:09 matentzn

I suggested elsewhere a few minutes ago the same concatenation suggestion as @cthoyt (including the artifact_id as the beginning of the identifier). If an SSSOM file contains the same mapping twice, if they are identical I don't think we care at a practical level about which one is 'identified'; if the difference is some additional metadata about that mapping, just make the concatenation cover the entire content. With column headers too (after column 3) if one wanted to be, umm, exhaustive.

What I like is that it's human-traceable and even human-comprehensible (depending on necessary escapes), which makes up for the exhaustingly long identifiers. And no extra work needed by the author, so the SSSOM stays Simple.

I don't understand what's not enough for the walking provenance problem, can we be more explicit about what's needed and missing? (Unless it's obvious)

graybeal avatar Oct 08 '21 01:10 graybeal

if the mappingset id is part of that ID, you are right.. We can encode the whole walk like that, and it will look horrible, but will actually be quite readable once you recover from the initial shock. :)

matentzn avatar Oct 08 '21 10:10 matentzn