hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

Formally define GA4GH and MD5 checksums in VCF v4.4

Open tskir opened this issue 3 years ago • 6 comments

As discussed during GA4GH Connect 2021 calls:

@d-cameron

Just checked VCFv4.3: MD5 isn't even formally defined as a ##contig header field. We should add both ga4gh & md5 into the specs for 4.4

md5 is shown in the example VCFs included in the specs, and the specs mention MD5 in S1.4.7, but it's not actually formally defined (e.g. header name, how the md5 hash is string encoded, etc

@jkbonfield

Agreed, it needs a tight definition - uppercasing, white space removal, what to do about out of range chars, etc.

tskir avatar Mar 01 '21 22:03 tskir

Hope we can refer to refget's spec as the reference for that final point @jkbonfield and if refget's spec is insufficient we will update accordingly

andrewyatz avatar Mar 01 '21 23:03 andrewyatz

Thank you for starting the discussion, @tskir - very much in support of this and would love to see this integrated with the refget services.

ohofmann avatar Mar 01 '21 23:03 ohofmann

I'm hoping the definition of what's valid and how to deal with invalid chars is compatible between CRAM (edit: actually SAM) and RefGet (I'm sure they must be given the ancestry), which will therefore serve as the logical starting point for VCF.

Down stream, I think it's maybe worth trying to enforce more for VCF 4.4. Right now, not only does it not require any checksum or assembly information, even the contig lines themselves are purely optional! I can see it may help for rapid hacking around, but we're past those days and should really focus on data provenance for future spec versions.

jkbonfield avatar Mar 02 '21 10:03 jkbonfield

A way to refer to a reference collection in 1 line instead of needing thousands of contig lines would make requiring it a much easier pill for most users to swallow.

lbergelson avatar Mar 04 '21 20:03 lbergelson

Isn't that almost have that specc.ed out in refget?

On Thu, Mar 4, 2021 at 3:45 PM Louis Bergelson [email protected] wrote:

A way to refer to a reference collection in 1 line instead of needing thousands of contig lines would make requiring it a much easier pill for most users to swallow.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/551#issuecomment-790929000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUR2R5ASCHIWZFY3OYTTB7WOJANCNFSM4YNQPTOA .

yfarjoun avatar Mar 04 '21 20:03 yfarjoun

It's getting there :) @yfarjoun

andrewyatz avatar Mar 04 '21 20:03 andrewyatz