Adding detail to RetrievalSource provenance

Open mbrush opened this issue 2 months ago • 0 comments

Exploring some modeling that would support capturing a couple additional retrieval source provenance details on a per edge basis. To discuss on upcoming MUTT/DINGO call:

added an ingest_source permissible value - to help capture which source the data was actually ingested from (and made the RetrievalSoruce.resoruce_role slot multivalued - to allow indicating that a particular primary or aggregator source was also the 'ingest_source')
also tested an alternative pattern to capture this info - that defines a separate slot to capture the ingest_source - ingest_source: boolean
- this pattern may make it easier to parse out this important data, and allow us to make capturing this info required if we want
- if we decide to capture this type of metadata, chose one of the two implemented patterns
added an ingest_files slot to RetrievalSource - for use in the RetrievalSource object for the ingest_source, to report files(s) from which the data used to create the edge were retrieved. This provides more complete provenance, and supports various downstream activities:
- developer debugging (lets us better trace edges back to the source data
- manual QA efforts (help reviewers organize edge types by file source - e.g. very helpful for CTD)
- more precise provenance for end users to understand where the edge came from
- identifying edges that may need to be updated/reviewed if a source updates its data/files

. . . If not at the edge level in the data, perhaps making it standard to put this info in the RIG for each 'EdgeType' object?

finally, I added defs to ResourceRoleEnum values - which I think we should keep even if we don't adopt the other proposals above

Oct 26 '25 17:10 mbrush