hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

Copy the tag type text from SAM spec to SAMtags (PR #804)

Open jkbonfield opened this issue 1 year ago • 5 comments

SAM section 1.5 clearly defines the standard SAM tag types along with the expanded codes used in the B byte-array type. This text has been copied into the SAMtags document to remove a rather woolly definition there. The text describing lower-case tags has been removed, because this was already discussed in further detail at the end of the SAMtags document (and is itself somewhat woolly when it comes to half-upper half-lower combinations due to the addition of draft tags).

Fixes #798

jkbonfield avatar Jan 07 '25 15:01 jkbonfield

Changed PDFs as of 7f4af4b9d87ef99b3701d78d63512c0c4ae2cbd1: SAMtags (diff).

github-actions[bot] avatar Jan 07 '25 15:01 github-actions[bot]

Why not link to § 1.5 "The alignment section: optional fields", rather than duplicating the serialization format here? The introduction notes that this is a companion document, which associates the two together anyway.

Alternatively, you could replace the type codes with generalized types, e.g., i => integer, Z => string, B,I => uint32_t[], etc. This would remove the need to describe the format.

zaeleus avatar Jan 22 '25 20:01 zaeleus

I agree with @zaeleus that it'd be better to have a less woolly[^1] summary of the types that are actually used in the table and a pointer to SAMv1.pdf “for the complete details” rather than duplicate the complete details here.

I don't think we should use informal types such as integer, string, uint32_t[]. The reason the types are listed in the quick-reference table is that these fields need to be of exactly the specified type, so we do need to be precise.

[^1]: Less woolly than the current summary in SAMtags.pdf

jmarshall avatar Jan 28 '25 20:01 jmarshall

If this is woolly then the SAM specification is woolly too as this is literally just replacing the poor text originally here with the SAM specification text. Or maybe I just misunderstand and you're saying this is better than the old text? I simply tightened up what was there before as it had very vague terms ("character" could permit UTF-8, and "hexadecimal array" could imply comma-separated lists of hex values).

We do need to be careful about sizing though. SAM is not specifically tied to 32-bit (signed or otherwise), and we've exploited this in the past when we've gone from 32-bit coordinates to 64-bit coordinates. The BAM serialisation has issues, but the SAM encoding does not and we shouldn't needlessly paint ourselves into corners by over specifying as it restricts migration to better encodings. (Sadly CRAM made those same mistakes though and I regret not fixing them for CRAM 3.0)

Arguably however, and maybe contentiously, if we wish to only have one canonical source for the meaning of the type fields then it perhaps should be SAMtags. The reason it was split in the first place was that the tag encoding is shared between SAM, BAM and CRAM (and now other things like extended FASTQs and GFA). Hence keeping it in its own document.

jkbonfield avatar Jan 28 '25 20:01 jkbonfield

SAM is not specifically tied to 32-bit (signed or otherwise), and we've exploited this in the past when we've gone from 32-bit coordinates to 64-bit coordinates.

SAM positions are tied to 32-bit signed integers:

Col Field Type Regexp/Range Brief description
4 POS Int [0, 231-1] 1-based leftmost mapping POSition

SAM data field values, however, are not:

The number of digits in an integer optional field is not explicitly limited in SAM.

zaeleus avatar Jan 28 '25 20:01 zaeleus