biolink-model icon indicating copy to clipboard operation
biolink-model copied to clipboard

Adding detail to RetrievalSource provenance

Open mbrush opened this issue 2 months ago • 0 comments

Exploring some modeling that would support capturing a couple additional retrieval source provenance details on a per edge basis. To discuss on upcoming MUTT/DINGO call:

  • added an ingest_source permissible value - to help capture which source the data was actually ingested from (and made the RetrievalSoruce.resoruce_role slot multivalued - to allow indicating that a particular primary or aggregator source was also the 'ingest_source')

  • also tested an alternative pattern to capture this info - that defines a separate slot to capture the ingest_source - ingest_source: boolean

    • this pattern may make it easier to parse out this important data, and allow us to make capturing this info required if we want
    • if we decide to capture this type of metadata, chose one of the two implemented patterns
  • added an ingest_files slot to RetrievalSource - for use in the RetrievalSource object for the ingest_source, to report files(s) from which the data used to create the edge were retrieved. This provides more complete provenance, and supports various downstream activities:

    • developer debugging (lets us better trace edges back to the source data
    • manual QA efforts (help reviewers organize edge types by file source - e.g. very helpful for CTD)
    • more precise provenance for end users to understand where the edge came from
    • identifying edges that may need to be updated/reviewed if a source updates its data/files

. . . If not at the edge level in the data, perhaps making it standard to put this info in the RIG for each 'EdgeType' object?

  • finally, I added defs to ResourceRoleEnum values - which I think we should keep even if we don't adopt the other proposals above

mbrush avatar Oct 26 '25 17:10 mbrush