hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

SAM/BAM/CRAM long chromosomes

Open jkbonfield opened this issue 2 years ago • 8 comments

The issue of dealing with chromosomes longer than 2Gbp came up again recently in samtools. Htslib/Samtools can cope with 64-bit chromosome coordinates for SAM, and also the (very drafty) CRAM 4.0, but not in standard CRAM nor BAM, both of which have signed 32-bit integers. Using SAM is a workaround, but it's not supported universally (eg htsjdk rejects such data).

To facilitate more tool chains and coping with limited formats such as BAM, one technique is to artificially break long chromosomes into smaller ones. Eg chr1a, chr1b, chr1c. However there's no standard meta-data here that can be used to indicate these form the same chromosome.

Things that we'd need to know:

  • Which @SQ lines belong together; a name of a previous SQ field or a name of a common external assembly contig ID.
  • What the start coordinate is on the original assembly chromosome, or a relative offset to a previous SQ line. This could also be implicit (a running total, so they butt up end to end in the order defined in SQ header), but we may wish to have overlap so we don't need to truncate alignments. In this case we'd need an offset into the previous chromosome.
  • Optionally what total length is the original assembly chromosome.

An example could be:

@SQ	SN:1a	LN:2000000000
@SQ	SN:1b	LN:2000000000	PN:1a	PO:1999999850
@SQ	SN:1c	LN:123456789	PN:1c	PO:1999999850

This specifies previous-name (PN) linking to an earlier SN, and PO as an offset into that previous name so we know how the two link. PN isn't strictly needed here, so maybe we can omit it. If we do have it then it raises questions, such as what happens if we have multiple PN to the same SN at the same coordinate? That's starting to feel like alternative alleleles, which is supported by AH, but that is strictly limited to alternate locus and banned from being used on the primary assembly contigs. The scheme above would also encode basic graphs, unless we add rules to forbid it. That's potentially useful, and also maybe a hindrance to tools that aren't aware of the headers.

The alternative to a daisy chain, with all the pros and cons that go with it, is always anchoring to a primary assembly with an absolute coordinate. Eg:

@SQ	SN:chr1a	LN:2000000000	PN:chr1	PO:1
@SQ	SN:chr1b	LN:2000000000	PN:chr1	PO:1999999850
@SQ	SN:chr1c	LN:123456789	PN:chr1	PO:3999999700

That's primary-name (chr1) and primary offset. The offset fields though are now 64-bit and code reading the file would need to cope with this. This could still permit overlapping data representing alternate locii, but we shouldn't be using it that way unless it also has an associated AH tag. We could also add a PL tag for the total primary length, so we know how much there is at either side of this fragment without scanning through all other header records. My gut instincts tell me this second representation is the better one, but it depends on whether we wish to permit graph construction in the SQ headers. Also perhaps A for assembly instead of P in the tag naming too.

What are people's thoughts on tackling this issue? It's clearly becoming a problem, especially with projects like Darwin Tree of Life, so we need to provide an appropriate way of managing the issue before people invent their own mechanisms.

Tagging @mcshane for input too.

jkbonfield avatar Jun 20 '22 08:06 jkbonfield

Hi @jkbonfield . I much prefer the second solution: it looks more intuitive to me, but also provides the original sequence name as PN. Once a format is agreed, could htslib do the conversion itself ? i.e.:

  • its API would be using 64-bits
  • reading/writing SAM would be as is
  • writing BAM would do an implicit split every ~512 MB, and reading would merge the chunks back
  • writing CRAMv3 would do an implicit split every ~2 GB, and reading would merge the chunks back

muffato avatar Sep 13 '22 14:09 muffato

The implicit vs explicit nature of this is probably outside of hts-specs itself, but it's an interesting point and agreed it's good if we can produce a nomenclature that makes that both possible and easy. I'll have to think some more on it, but it does sound doable, in much the same way we already transparently handle "alternative names".

jkbonfield avatar Sep 13 '22 14:09 jkbonfield