ga4gh-schemas icon indicating copy to clipboard operation
ga4gh-schemas copied to clipboard

Provenance

Open jacmarjorie opened this issue 10 years ago • 4 comments

The idea of provenance came up in the G2P meeting today, and I thought it would be a good idea to open up a thread for conversation around this. The idea of standardizing provenance deserves a lot of thought and fits into several task teams. Should there be thought of developing a provenance task team?

One direct example would be the idea of querying on provenance to gain insight into commonly used pipelines. If a data management system (i.e. Synapse, LabKey) is storing provenance, and doing so in a GA4GH compliant manor, API calls could be used across multiple data management systems to generate statistics about most commonly used tools for an analysis pipeline. This could provide insight into the best analysis workflows out there, and work towards the standardization of such.

jacmarjorie avatar Feb 19 '15 23:02 jacmarjorie

Hi Jacmarjorie,

We have a discussion on data provenance in Containers and Execution group. I think provenance must be part of those group.

Max

max-biodatomics avatar Feb 20 '15 00:02 max-biodatomics

Max, will you point me to this discussion? Thanks.

jacmarjorie avatar Feb 20 '15 00:02 jacmarjorie

Provenance and identification are GA4GH wide topics. While the work the containers group is doing is important, particularly digests, tracking must extend beyond any what is or can be done in any recommended container.

Provenance tracking must cover all data and metadata. A change in the metadata can drastically change the interpenetration of the data.

Also, GA4GH's mission is data sharing. Any provenance information has to be independent of particular software implementations.

Perhaps you can present on a metadata call and we can start coordinating? We very much need to pay attention to provenance tracking across all of the task teams.

Mark

MAX [email protected] writes:

Hi Jacmarjorie,

We have a discussion on data provenance in Containers and Execution group. I think provenance must be part of those group.

Max

— Reply to this email directly or view it on GitHub.*

diekhans avatar Feb 20 '15 01:02 diekhans

I am sorry, I missed the most recent e-mail. The containers group is on early stages of developing standards yet. The discussions which we had is a information about tools, parameters and binaries. For example it you adding metadata about tool you need to specify a version. Some times even building information for tools. The main consensus which were reached that we will use a Docker containers for binaries distribution where is possible (unfortunately not everywhere). The Docker has a unique hash code for containers.

So, the metadata for each file should contain information on tool, version, all parameters and docker image.

max-biodatomics avatar Mar 02 '15 15:03 max-biodatomics