dipper icon indicating copy to clipboard operation
dipper copied to clipboard

Update Source: ClinVar - Refine evidence/provenance metadata

Open mbrush opened this issue 7 years ago • 2 comments

First pass ingest of ClinVar-XML (#276)) only pulled minimal evidence and provenance information.
As SEPIO matures, we will soon be at a point where we can revisit this source to pull more of this info. A couple specific things to address:

  1. Modeling of evidence for associations based on the 'literature only' Method needs fixing. According to their documentation, this Method is cited when "Data is extracted from published literature with interpretation as reported in the citation". So the agent asserting the SCV here is just parroting assertions made in published papers. Our current model creates a single evidence line for these SCVs and links it to all referenced pubs. But in reality each pub is making an assertion that is used as evidence for the submitting agent's assertion. So we should create separate evidence lines (typed as a ECO:TAS) for each referenced paper, where each line is linked only to that single publication. And we could also create an assertion bnode as the supporting info since wee can infer that these are made given the definition of 'literature only'.
  2. Modeling of evidence for associations based on the 'curation' Method also need fixing. This method is used "for variants that were not directly observed by the submitter, but were interpreted by curation of multiple sources, including clinical testing laboratory reports, publications, private case data, and public databases." Here it is not clear how many lines of evidence exist - only that a set of referenced pubs were used in finding evidence to assess in making the SCV assertion. Here i might propose just linking the referenced pubs to the assertion directly, using dc:source. We could not create any evidence line (since we have no idea what or how many there are). Or we could create one line and link it to a supporting 'data curation' activity (so we continue to capture and be able to search/filter on assertions generated through literature curation).
  3. There is sparsely populated metadata about clinical subjects genotyped and studied in generating evidence for some assertions (those tagged with the 'clinical testing' Method in particular'. We could attempt to bring this data in as well - but would likely not add much value as it is sparse.

Note also that I suspect that ClinVar may be changing its data model given recent activity in the ClinGen community - so given the low immediate value of evidence metadata we collect from ClinVar, it may be best to just make the easy fixes to 1 and 2 above, and not spend time parsing out additional metadata as in 3.

mbrush avatar Mar 14 '17 22:03 mbrush