hts-specs Discussion: can we have programmatic output as well as PDF/LaTeX?

There are many constants that are defined in hts-spec, but the only way to use them (currently) is to manually copy/update them in one's own implementation. If we were to publish some artifact/artifacts containing these constants, maintainers would be able to import that artifact and use it in their code.

I think that there are two general needed parts here:

Maintain a file that is computer-readable, containing relevant constants organized in a suitable hierarchical format with rich data types.
Emit code that contains classes/constants that the implementers can use directly without having to recode.

Ideally, each of the hts-constants would be defined in a single "original" place, and all the other uses would be automatically generated from that.

Here's my idea for an implementation (based on https://github.com/aantn/reconstant)

Have a configuration file that contains the constants of interest. For example, the SamTags, Sam-header tags, "magic" strings. This configuration file will be the only definitive place for adding/modifying constants. Artifacts in different languages (python,java,c,rust,latex,R) will be generated via the make-file.

SamTags.tex (for example) will include and use said artifact
a version "release" will include packaging up the code and making it available for different languages using the various artifact-distribution options available.

I'm mostly thinking about the SamSpec, but, of course, different sub-specs could choose to use this mechanism or not, individually, for example, VCF, refget, etc.

Feb 16 '25 18:02 yfarjoun

I like this idea for its formality, but I'm not sure language-specific details or implementations belong with the specification.

Libraries, in any particular language, probably should be providing such constants in the first place? For example, in the Rust library noodles, there are SAM data field tags, SAM record flags, etc. Compared to the Rust output in #815, see there are differences in nomenclature (e.g., the data tags use full names rather than short codes) and type definitions (e.g., flags have a type-safe wrapper rather than being an integer).

Regarding enums, note that the SAM/BAM specification maintainers don't consider field values to be closed sets (see https://github.com/samtools/hts-specs/issues/725#issuecomment-1542108365). noodles changed enums to common string constants (e.g., SAM header read group platform values) because of this argument.

Mar 02 '25 17:03 zaeleus

I understand your hesitation, and definitely do not want to pretend that the implementation provided in #815 is ideal or even good. The point of that PR was to provide an implementation that would clarify my intention regarding how hts-specs might provide a single point definition of constants. The details of the implementation can be discussed in that PR, after we discuss here if the idea is worthwhile....

The reason I thought it would make sense that hts-specs would provide a definitive set of constants is that it makes it easy to include and recognize the library. I've seen many (mostly python) packages that re-define the hts-spec constants that they need. This provides aple oppornity for mistakes & misunderstandings when reading/using these pacakges.

If the consensus is that such a collection of small libraries is pointless, I'm happy to close this issue....and I also accept the fact that I'm a little late to the game and that the existing libraries are unlikely to include the ones we may release here and make use of them...but I am still curious to see what the community thinks.

Mar 02 '25 21:03 yfarjoun

I can see the use of something like this: for example, to use @zaeleus's usual bugbear :smile:, it would be useful to provide an up-to-date list of the valid @RG-PL values in a machine-readable format. However I don't think the specification should be in the business of inventing additional names for all these tags/keywords/codes/etc, particularly when some implementations may have already invented their own names for them. And IMHO #815's suggested description field is an unnecessary maintenance burden when by definition it accompanies the full description in the spec.

I could support adding something lighter-weight, listing the tags and keywords that are currently defined by the specification that are subject to being added to in future. For example, for SAM/BAM/CRAM this could be a JSON file something like pub/sam.json:

{
  "headers": {"HD": ["VN", "SO", "GO", "SS"], …},
  "HD_SO_values": ["unknown", "unsorted", "queryname", "coordinate"],
  "HD_GO_values": ["none", "query", "reference"],
  "SQ_TP_values": ["linear", "circular"],
  "RG_PL_values": ["CAPILLARY", "DNBSEQ", "ELEMENT", "HELICOS", "ILLUMINA", "IONTORRENT",
                   "LS454", "ONT", "PACBIO", "SINGULAR", "SOLID", "ULTIMA"],
  "record_tags": {"AM": "i", "AS": "i", "BC": "Z", …, "CG": ["B", "I"], …},
  "draft_record_tags": {}
}

IMHO that would suffice, and it would be best to leave it up to implementations what, if anything, they wanted to do with the data in such a file. I don't think it would be worthwhile to have the LaTeX spec derive these items from the machine-readable version; e.g., we have textual descriptions for some of the platform values that would be non-trivial to implement in code in LaTeX. So I don't think adding something like this JSON file would be a big maintenance burden, even though the tags and value keywords are duplicated in it.

Regarding enums, note that the SAM/BAM specification maintainers don't consider field values to be closed sets

Reality also does not consider these field values to be closed sets.

Mar 04 '25 02:03 jmarshall

Thanks @jmarshall for the thoughts and comments.

I agree that the autogenerated LaTeX is possibly a step too far and without that, there's no need for the descriptions, and types.

The reason I suggested autogenerated code was that it would then be relatively straightforward to autogenerate a collection of libraries/packages (one per language) that could be included into a project with the language-appropriate packaging tool.

I like the idea of a json with the tags/values, I was simply unaware of a good way of including that in a code project. This is not so surprising given that I'm far from being an expert on the matter of software packaging....

Do you or anyone else know of a good way of packaging a json as a first-class citizen in different code languages?

Mar 04 '25 15:03 yfarjoun