Adding detail to RetrievalSource provenance
Exploring some modeling that would support capturing a couple additional retrieval source provenance details on a per edge basis. To discuss on upcoming MUTT/DINGO call:
-
added an
ingest_sourcepermissible value - to help capture which source the data was actually ingested from (and made theRetrievalSoruce.resoruce_roleslot multivalued - to allow indicating that a particular primary or aggregator source was also the 'ingest_source') -
also tested an alternative pattern to capture this info - that defines a separate slot to capture the ingest_source -
ingest_source: boolean- this pattern may make it easier to parse out this important data, and allow us to make capturing this info required if we want
- if we decide to capture this type of metadata, chose one of the two implemented patterns
-
added an
ingest_filesslot toRetrievalSource- for use in theRetrievalSourceobject for the ingest_source, to report files(s) from which the data used to create the edge were retrieved. This provides more complete provenance, and supports various downstream activities:- developer debugging (lets us better trace edges back to the source data
- manual QA efforts (help reviewers organize edge types by file source - e.g. very helpful for CTD)
- more precise provenance for end users to understand where the edge came from
- identifying edges that may need to be updated/reviewed if a source updates its data/files
. . . If not at the edge level in the data, perhaps making it standard to put this info in the RIG for each 'EdgeType' object?
- finally, I added defs to
ResourceRoleEnumvalues - which I think we should keep even if we don't adopt the other proposals above