ga4gh-schemas Round trip analysis of read data from BAM to GA4GH to BAM

Round trip analysis of read data from BAM to GA4GH to BAM

Open diekhans opened this issue 8 years ago • 2 comments

This ticket is to collect observations on the round trip analysis for read data.

Nov 24 '15 18:11 diekhans

analysis of schema types.

Identifiers - read groups, fragment names, etc, have no regexp associated with them. This means sooner or later people will start putting unicode in names or starting them with "@" symbols. While perfectly fine for GA4GH, you may wish to block such things in order to support round-tripping to SAM.

Data types:

GASearchReadsRequest: GASearchVariantsRequest:

A way to limit the number of fields returned would be ueful. Perhaps the caller is not interested in the sequence quality, or the auxiliary "info" fields. The ad-hoc API used by Google's GABrowse does this, requesting more data when showing sequence view and less data when only showing outline boxes.

GACigarOperator:

Note that CLIP_HARD if present has to be the outermost (first or last) cigar element. If CLIP_SOFT is present then it has to be the outermost, or second outer-most if CLIP_HARD is in use. Ie Hard, Soft, or , Soft, Hard.

GACigarUnit:

referenceSequence: replaces MD, but what about NM? They're commonly produced together in samtools calmd, although maybe this is just historical. I assume NM is still supplied in the "info" array.

GAProgram:

prev vs next program. SAM is also prev, but it blocks some things. You cannot have multiple prev programs, which makes merging files tricky (you end up with many distinct chains). "next program" probably has the same issue, but for splits. Ideally it should support multiple links rather than just one, or some graph form. Alternatively with a UUID per sam @PG line, when merging you can tell apart the case of lots of separate calls of the same program on many files vs one call of the same program followed by a file split. Low priority - I'm not sure how many people update these fields correctly in SAM right now.

GALinearAlignment:

Data isn't just mapped or unmapped; it can be "placed" but unaligned. (Chr/Pos but CIGAR *) This is commonly done where one read in a pair is mapped and the other is not. I believe the GA4GH format permits this, but perhaps needs spelling out explicitly. Is this simply achieved by having ALinearAlignment with a GAPosition and an empty GACigarUnit array?

GAReadAlignment:

This is confusing terminology. In SAM (I think!) a linear alignment is a single SAM record while a read alignment is either a linear alignment or a chimeric alignment (a set of linear alignments).
Every read must belong to exactly one read group. This is not a requirement in SAM (but perhaps ought to be). I assume for entries with no read-group there will be a dummy "Unknown" read group automatically created for them?
numberReads - good extension. It breaks SAM perhaps, but I'm not convinced it works consistently with more than 2 fragments anyway.
Also needs a way to indicate how many alignments exist for any specific record. This is (perhaps? it's not clear) equivalent to the TC aux type in SAM, but I'm not sure if that includes secondary matches.

The reasoning here is how to create an efficient algorithm for read collation or related tasks. Typically these can push reads into a hash table and then pull back out once the pair is found, removing from the hash. This makes the code efficient and memory usage low, however it gives wrong results when secondary / supplementary reads are present. Therefore the widely used alternative is simply to do a full-blown sort instead of location collation; unnecessarily CPU intensive.
readNumber - we need a way to indicate unknown. Is that -1? Sanger style sequencing could have universal forward / reverse primers and custom oligos (primer walks), with the former being just considered as "internal" and not a known Nth read number. Such data isn't used now, but we should still consider what to do if a value isn't known.
supplementaryAlignment - they don't necessarily have to map to different contigs, just far enough apart to be considered more than a minor deletion or in more than one orientation. "In each linear alignment in a chimeric alignment, the read will be hard clipped": this is only recommended for supplementary alignments, but isn't advisable for the non-supplementary reads if you wish to support easy conversion back to FASTQ again.
map<array> info As previously mentioned, this means it is not possible to distinguish XA:i:10 from XA:Z:10; ie strings container numbers. This means BAM to GA4GH back to BAM cannot be round-tripped unless our map includes the type byte too.

This section needs more expanding.
Think hard about fragmentLength/TLEN. Do you want to document how it is mostly implemented (5' to 3') or how it is defined in SAM (leftmost to rightmost). There are similar question marks about the meaning of pnext and rnext. One possibility to define in this schema is TLEN as used rather than as SAM and just accept there is a disparity in meaning (but don't actually check it or you'll open a HUGE can of worms!).
In SAM we support (abuse?) a few fields in order to wedge in things like fake consensus reads and assembly annotations. Do these still work in the GA4GH schema?

GAReferenceSet:

Is the md5checksum a concatenation without whitespace or with newline or similar delimiter? On unix it'd be easy enough to extract all the GAReference MD5s, one per line, and do | sort | md5sum.

General searching

Is it possible to resolve some of the impossible BAM queries, like all templates that overlap a region? (Eg read/1 at 10-11k, read/2 at 20-21k, with a range query of 15-16k.) Sometimes that's done algorithmically by just extending the query and doing an intersection. One possibility is an R-tree on templates as well as an R-tree on fragments, although this rapidly gets messy with supplementary and secondary alignments. It's probably too hard to address here, but maybe someone has a bright idea.

_Final musings_ On the whole issue of round-trips, BAM -> GA4GH -> BAM, one naively thinks it should be 100% lossless. However lossless also means no format innovation (just encoding innovations). We've seen countless discussions on samtools mailing lists , usually unresolved, regarding the meaning of TLEN, PNEXT and RNEXT fields. The aligners that implement these fields don't always agree with the definitions, indeed most disagree with the SAM specification.

Note that there is a difference in derived data being lost. If we sort by name, run samtools fixmates, and resort to position, then we may get different values. Any change to those values while not necessarily desireable is also not "lost" data, just slow to recreate data. (This probably isn't a useful distinction.)

There are also fix cases for bugs or things that simply do not matter. In BAM -> CRAM -> BAM conversions we see a number of things that change, but are apparently accepted. e.g.

Incorrect BAM flags. ("My mate is complemented" when infact it is not)
CIGAR strings on unmapped data (undefined behaviour)
Order of auxiliary tags.

NB: I haven't tested the API directly, this is purely from a read-through of the specification. Nor have I looked at the variant call side.

Nov 25 '15 09:11 jkbonfield

From https://github.com/ga4gh/server/pull/789 , we observed the following regarding round trip shortcomings:

"We can't really do a round trip since when we read reads from a BAM we are losing some information from the BAM if the read is unmapped or its mate is unmapped -- that is, we set either read.alignment or read.nextMatePosition to None, and in order to write that read to another BAM as it was in the original BAM, we require information that is not set on the GA object since those would be fields on what is a null Position object."

The fields we can't always write back correctly are (in pysam lingo): reference_start, next_reference_id, and next_reference_start. We also can't write back tags correctly because of this issue: https://github.com/ga4gh/server/issues/758

Dec 08 '15 15:12 dcolligan

ga4gh-schemas ga4gh-schemas copied to clipboard

Round trip analysis of read data from BAM to GA4GH to BAM

ga4gh-schemas
ga4gh-schemas copied to clipboard