hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

SAM header encoding

Open d-cameron opened this issue 1 year ago • 6 comments

The SAM specifications does not specify the encoding of the header.

S1 states Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII

S1 states that alignment lines have fields, and for headers, midway through S1.3 is says each data field follows a format, which implies there are parts of the header which are not "fields". For example, header line begins with the character ‘@’, is not described as a "field".

I recommend changing S1 to: Unless explicitly specified elsewhere, all content is 7-bit US-ASCII encoded.

Newline definition is also missing. Is that a can of worms that I don't want to reopen?

d-cameron avatar Aug 02 '22 06:08 d-cameron

I'm glad all other problems are solved, so that we can focus on quibbling about the text describing something for which the intention is IMHO clear… :smile:

The intention is that SAM files are text files whose entire contents are US-ASCII encoded in the obvious (8-bit) way, apart from several particular fields that are individually specified as allowing UTF-8 (i.e., Unicode characters encoded as UTF-8) in their values. Or equivalently, that SAM files are UTF-8-encoded text files without BOM but characters beyond ASCII are only allowed in certain particular field values as individually specified.

Certainly this could be expressed better — e.g., I have never been happy with the “Unless explicitly specified elsewhere” clause that seems to suggest anything could happen in some parts of the file, and would prefer something like “Apart from certain field values explicitly specified as allowing UTF-8, …”. And the bit about locales in that sentence needs some copy editing and explicitness about what it's trying to tell us.

Do you have examples where implementors have been confused as to character encoding issues in SAM files? Or examples were users have had difficulties due to such issues? (Note that in this tabix report the issue was that the user did not realise his input file was UTF-16-encoded, not that he had any expectation that these tools would accept a UTF-16-encoded file.)

So IMHO we should rephrase this somewhat along the lines of my second paragraph above, but I think the intention is relatively clear to most and I don't believe character encoding issues are causing any particular confusion in SAM.

jmarshall avatar Aug 02 '22 08:08 jmarshall

Agreed. While it could almost certainly be tightened up, that's more from a strictly-speaking stance rather than a lack of clarity in what the intention is for encodings.

Regarding line endings, this is something which probably should have been defined from the outset but was not. As a consequence of SAM being defined as textual and differing definitions of text between systems, it's reasonable to assume that tools should be able to cope with textual data appropriate for the system they are running on, plus if the tool authors are kind, textual data that was produced on systems different to the one being used. As it happens both htslib and htsjdk (as well as scramble and sambamba, probably more) already support both NL and NL-CR for SAM. While it's not explicitly spelt out in the spec, it seems most implementations already do the right thing. We could add it as a reminder to people to consider foreign data sources, but so far it's not been a problem.

jkbonfield avatar Aug 02 '22 08:08 jkbonfield

As for newlines:

SAM files are described as “text files”, so end-of-line indicators are (of course) whatever the host operating system conventions say they are.[^1] As a quality of implementation matter, implementations may wish to additionally accept alternative newline indicators — such as accepting CR-LF on a Unix host, or LF on a Windows host.

This — punting to the local operating system convention — follows the sort of phrasing used by e.g. the C and C++ standards for program source code, and IMHO is a better approach than specifying particular newline characters and characters sequences (as the VCF spec does). Doing that prevents SAM from being implemented on weird platforms, and it may be impractical for some implementations to accept non-native newlines like CR-LF (e.g. quick and dirty scripts).

It could be worth talking about newline characters in a footnote and recommending accepting CR-LF or conversely, but I would not want to see this be made normative.

A more interesting question is: what line termination should be accepted in BAM files' plain text header block? This is a text blob embedded in a binary file, so unlike plain text SAM files implementations can't just use fgets() to have the C library take care of the newline representation. In practice, IIRC implementations (htslib and htsjdk at least) accept both LF and CR-LF here — probably as a byproduct of their QoI niceness for SAM text files — but it is in greater need of being specified. (See also samtools/samtools#661 which also focusses on whether a final newline is required.)

[^1]: i.e., what the conventions say for an ASCII text file, as all newlines are outwith any SAM field values designated as allowing UTF-8.

jmarshall avatar Aug 02 '22 08:08 jmarshall

The SAM header lines in BAM I was assuming should always be complete and ending in a newline, but looking at the SAM spec the regexp is per-line with nothing to say about the line endings. As such I think it's maybe a valid (if rather unusual and brave) interpretation to consider the newline as a separator between lines and not part of the line itself, hence not required for the end. Frankly I'd consider that a step too and when looking for trouble don't be surprised if you find it!

I'd be in favour of adding text to require all lines, in both SAM header / alignments and the BAM header portion, to end with an appropriate text line terminator. In a footnote we can cover the recommendation of also supporting line terminations for other system types.

Edit: Regarding the original comment on "fields" vs "content", I agree it's clearer. Although even in UTF-8 there isn't any alternative encoding of horizontal tab, so it's not possible for someone to be "clever" and interpret the spec as permitting anything other than U+0009 for the field separator, so practically speaking fields vs content is the same thing I think.

jkbonfield avatar Aug 15 '22 09:08 jkbonfield

Hmmm… I think my previous two comments have somewhat contradicted each other, as the first comment was really based too much on my own worldview of Unix as the natural platform. So to unify both encoding and newlines under the local platform “text files” rubric:

SAM files are described as “text files”, so there is a point of view that — just as end-of-line indicators are whatever the host operating system conventions say — the encoding is whatever the host operating system conventions say it is. So on a platform where UTF-16 is a common conventional encoding for text files that point of view says that tools should accept SAM files encoded like that. (Windows may be such a platform.) In that case, the text about UTF-8 being valid in certain specified fields is misstated: what it means to say is that Unicode characters beyond ASCII are valid in those fields, and to remain agnostic about the encoding.

An 8-bit ASCII encoding is mostly a lowest common denominator, and will be usable on most platforms. And so perhaps the spec should call that out as a transportable form of a SAM file — i.e., the 8-bit ASCII with a few specific UTF-8 portions a.k.a. UTF-8-sans-BOM with non-ASCII allowed only in a few specific portions form described in the previous comments.

But SAM files are described as “text files”, so that point of view would state that other locally conventional encodings would be acceptable in addition to the transportable form from the previous paragraph.

In practice, neither samtools nor Picard allows any such thing (at least on macOS or Unix, where UTF-16 is not conventional). I am not sure other tools do either, but I don't use exotic platforms enough to know.

So perhaps we would do well to revise the text to explicitly bless the ASCII+designated-UTF-8-no-BOM encoding via the previous descriptions (and mention particular newline sequences in a footnote), and IMHO also allow locally conventional “text files” and hence locally conventional end-of-line representations.

jmarshall avatar Aug 16 '22 20:08 jmarshall

At last month's meeting, @daviesrob pointed out that the specification text “all fields are encoded using 7-bit US-ASCII” is intended to disallow UTF-16 et al and require the lowest common denominator 8-bit ASCII encoding described in the previous comments.

(It does after all have the word “encoded” in it! Nonetheless I had always read this as talking about 7-bit data transmission (akin to ye olde SMTP concerns), bounced off it in confusion, and assumed it was just a weirdly phrased way of saying “only 7-bit ASCII code points are allowed” and did not in fact specify an encoding…)

It turns out that the hts-specs group has in fact discussed this encoding question previously and formed a consensus: that text was added in response to issue #197, in PR #205.

That issue discusses flavours of ASCII (in order to tie down variance in the code points of various punctuation characters) rather than substantially different encodings like UTF-16, but nonetheless it seems to have resulted in a decision interpreted as specifying plain ASCII encoding only. Minutes from the meetings of the time are not available.

The discussion on that issue also clarifies that the specification text about locales is intended to reinforce that floating point numbers are to be represented in SAM text fields using . (FULL STOP) as the decimal point, not any other locale-specific decimal point character. (This should already be obvious from the regexes, but it is indeed good to reinforce it.)

jmarshall avatar Sep 13 '22 12:09 jmarshall