ga4gh-schemas Add examples of valid HGVS to documentation

For example, from the server test data:

va.transcriptEffects[0].hgvsAnnotation.toJsonDict()

gives

{u'genomic': u'1:g.46286_46288delTAT',
 u'protein': u'',
 u'transcript': u'n.46286_46288delTAT'}

I think we will ultimately defeat the utility of the GA4GH API if we don't enforce some standards on core data types, such as variants. (Ditto SO.)

Please add your thoughts about the severity of the issue, policy guidance, and perhaps technical solutions.

Apr 05 '16 17:04 reece

Personally, I would like to clamp down hard on consistent data. However the realities of being an interoperable data exchange API conflict on this.

Early efforts to standardize data identification didn't go anywhere which the view prevailing that the API is just for coding, it's up to the providers to standardize.

Data interoperability is usually much harder than API interoperability. At some point, it would make sense to have data interoperability standards with tools to enforce them. It would be good if this was independent of any given server implementation, not just a feature of the reference server.

Out of curiosity, who created that crazy HGVS notation?

Reece Hart [email protected] writes:

For example, from the server test data:

va.transcriptEffects[0].hgvsAnnotation.toJsonDict()

gives

{u'genomic': u'1:g.46286_46288delTAT', u'protein': u'', u'transcript': u'n.46286_46288delTAT'}

I think we will ultimately defeat the utility of the GA4GH API if we don't enforce some standards on core data types, such as variants. (Ditto SO.)

Please add your thoughts about the severity of the issue, policy guidance, and perhaps technical solutions.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub*

Apr 05 '16 20:04 diekhans

I agree that API interoperability and data interoperability are different things, but I they're not completely distinct. For instance, in the case of HGVS, it would not be that hard to require that HGVS-formatted are 1) syntactically valid (e.g., the example transcript variant doesn't have a sequence accession), 2) variants use sequence accessions only rather than names (e.g., NC_000001.10 rather than "1").

Out of curiosity, who created that crazy HGVS notation?

If you mean the crazy HGVS in the example, I dunno... it was from the server test data. Other elements are from SnpEff (including the issue of protein variants with transcript accessions that we've discussed previously).

If you mean HGVS in general, the short story is Johan Den Dunnen under the auspices of the HGVS 15-ish years ago. It's now overseen by a committee.

Apr 05 '16 20:04 reece

@reece Can the API reasonably be required to validate strings for given fields?

I think it should definitely be documented what the expected standards are that a field should be upholding (with inline examples of common mistakes to be avoided, such as the ones this exemplifies).

But I think it would be onerous for folks trying to quickly onboard existing warehouse of data to have their data fail to be be accessible (or uploadable?) because of ill-formed annotation strings.

Apr 05 '16 21:04 gaberudy

sigh

While we need to deal with data interoperability, the last thing we need is another task team that doesn't write anything down.

I think the example data is 1000 genomes :-(

Reece Hart [email protected] writes:

I agree that API interoperability and data interoperability are different things, but I they're not completely distinct. For instance, in the case of HGVS, it would not be that hard to require that HGVS-formatted are 1) syntactically valid (e.g., the example transcript variant doesn't have a sequence accession), 2) variants use sequence accessions only rather than names (e.g., NC_000001.10 rather than "1").
Out of curiosity, who created that crazy HGVS notation?
If you mean the crazy HGVS in the example, I dunno... it was from the server test data. Other elements are from SnpEff (including the issue of protein variants with transcript accessions that we've discussed previously).

If you mean HGVS in general, the short story is Johan Den Dunnen under the auspices of the HGVS 15-ish years ago. It's now overseen by a committee.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub*

Apr 05 '16 22:04 diekhans

On Tue, Apr 5, 2016 at 2:35 PM, gaberudy [email protected] wrote:

@reece https://github.com/reece Can the API reasonably be required to validate strings for given fields?

That's the question. I think I'd shoot for minimal syntactic validation and stop short of semantic validation. As we both know, validating HGVS is a can of worms.

I think it should definitely be documented what the expected standards are

that a field should be upholding (with inline examples of common mistakes to be avoided, such as the ones this exemplifies).

Agreed!

But I think it would be onerous for folks trying to quickly onboard existing warehouse of data to have their data fail to be be accessible (or uploadable?) because of ill-formed annotation strings.

I see your point. On the other hand, what's the utility of data that can't be searched reliably?

Admittedly, I'm pretty strict re: data cleanliness. If data are not known reliable and well-structured, they don't belong in a database (of any sort) because they can't be searched or used reliably.

My own experience is that I quickly adopt a dim view of a database when I find bogus data because it's just too hard to know what's not there, what is there, and how to use it.

Perhaps it would be useful to add validation flags/levels to VariantAnnotationSets? At least we'd know which sets might contain bogosity.

Apr 05 '16 22:04 reece

I like the idea of having a tool / server-process be able to "grade" or "validate" the data, like you see with github page tags (passes tests, fails tests, good code coverage etc).

Here is the thing on strict-ness. I would appreciate if as many genomic data produces of public data used GA4GH as possible so that our job of curating/access/cleaning all relevant genomic data is simplified. But if the strict-ness is too high, people will just not use the system and revert to what they already have (text files specific to each genomic pipeline, jammed packed INFO fields of VCF files etc).

In fact, I doubt many academic producers of genomic data will go through the effort to move to GA4GH for older projects (I'm thinking NHLBI 6500 exomes on the EVS for example). They just don't have the resources to revisit their data for any heavy lifting or re-analysis.

So anything to provide easy on-ramping of existing dataset, regardless of the state of their annotations is still a step in the right direction IMHO.

Apr 05 '16 22:04 gaberudy

@reece - we can clean up the test data to set a better example. It's not 1000 Genomes data - it was created specifically for setting up the reference server and re-used by the compliance suite, so can be changed or replaced. The 1000 Genomes release used an old version of VEP without the hgvs option.

We currently link out to the HGVS recomendations from our docs http://ga4gh-schemas.readthedocs.org/en/latest/api/alleleAnnotations.html#transcripteffect-attributes. As has been noted previously, these have been interpreted differently by different groups. Is the suggestion to document a limited GA4GH subset of HGVS?

Apr 06 '16 08:04 sarahhunt

ga4gh-schemas ga4gh-schemas copied to clipboard

Add examples of valid HGVS to documentation

ga4gh-schemas
ga4gh-schemas copied to clipboard