hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

Sequence Collections in SAM headers

Open jkbonfield opened this issue 8 months ago • 5 comments

This is to start the ball rolling on adding sequence collections to the SAM header, so we can use it in SAM, BAM, CRAM, but also a related issue for VCF and BCF. (It's easier there though due to the free form nature of the header so for now we can concentrate on the harder case and VCF hopefully will follow in a simpler fashion.)

I see a few potential avenues.

  • Adding a new @SC header tag. Cleanest solution as it's a direct separation of data types, but problematic for parsers.

  • Permitting an SQ header without SN or LN, but with a new sequence collection tag (eg SC). This is basically a substitute for @SC, but one that wouldn't break (as many) parsers. (Or alternatively, a known fake SN/LN field, eg SN:SeqCol LN:0).

  • Keeping all the existing @SQ header lines but populating the sequence collection tag. This is comparable to UR where we indicate the reference file, but now it's the collection this SQ belongs in. Inevitably it's somewhat verbose as we specify the same thing many times.

Inherently there are problems here, but it's hard to know how to navigate them such that we have the least pain while actually gaining useful functionality. The cleanest method breaks pretty much everything, from SAM parsing to BAM to CRAM. While the latter method has the fewest problems with existing parsers but it doesn't really add anything other than improved data provenance. Not to belittle it - that is useful - but it's basically just a synonym for identifying the assembly we used which we can already do.

This issue arose out of the latest RefGet meeting. Other topics that may be relevant were:

  • SQ M5 tag is for MD5sum, but sequence collections use sha512t24u.
  • Caching. We'd probably treat SeqCol just as another way of populating a reference cache, so we fetch sequences once and from then on we either load directly from local repositories (if we have SQ headers) or fetch the small SeqCol meta-data to get sequence IDs. (With caveats on types of checksum.)
  • Use of DRS for discovery from within sequence collections, and permitting more distributed reference servers, but that's early and ongoing work still (to be tracked in their github repo instead of here).
  • Whole gzipped fasta downloads for the entire collection (again via DRS and housed on e.g. s3 buckets). Not so useful except for initial population of a local cache, and even then it may include duplicate data when we have multiple variations on the same assembly.

Thoughts on where to start? I'm aware this is likely to be a long term issue, so discussing it now and getting the ball rolling may help speed things up when the sequence collections is finalised as well as perhaps steering their implementations in the most useful directions.

jkbonfield avatar Apr 30 '25 16:04 jkbonfield

I would prefer to use the @SC option, but it seems unlikely that we'll see any changes applied to htsjdk, at least in the foreseeable future. At which point this becomes more of a social discussion -- do we want to get methods developers to switch to other libraries, or do we want to maintain compatibility with IGV and Co?

ohofmann avatar May 06 '25 04:05 ohofmann

I think asking developers to switch programming languages is a much bigger uphill struggle. People can obviously vote with their feet, but most people simply don't have time to change from one library to another as it's a big task. It would make more sense to try and persuade people depending on htsjdk to support it more by making PRs against it. Similarly for every other implementation out there.

Note there is a big issue of people not upgrading software too. Sometimes it's for (perhaps flawed) validation / sign-off reasons where a pipeline has to go through a certification process and even retrofitting bug fixes is a massive reevaluation process, so people stick with outdated programs instead. Sometimes even with known security issues. I believe such policies need to be more nuanced, but even so it'd be a big ask to get someone to upgrade for a feature change rather than an important bug fix. So backwards compatibility is important.

All a bit sad things are so inflexible, but we are where we are. It needs careful consideration.

jkbonfield avatar May 06 '25 08:05 jkbonfield

can you add some more motivating text or background to this issue?

cmdcolin avatar May 06 '25 16:05 cmdcolin

The motivation came jointly out of the team handling the Sequence Collections part of RefGet and from EBI's desires to reduce the load on their server.

Personally I don't really get the latter as it shouldn't be that bad and they ought to be able to do better traffic shaping to simply block any serial offenders. One thing they're hoping for is asking for a single object should be much more efficient than many smaller queries for header information, but this largely comes down to a very heavy-weight server where each http connection ends up firing up a new database connection behind the scenes. A lot of the impetus to change is therefore arguably better solved by rewriting the refget server, or better yet persuading someone in GA4GH to fund a proper CDN solution. That plus more sensible local caching should alleviate the need for going often to the EBI for references. (Sadly it's stymied by poorly configured cloud setups, which is why we're looking at removing default fall-back to the EBI from htslib, so you can't accidentally fall back into a DDOS mechanism.)

That said, sequence collections are a useful thing to have in and on their own merit. They vastly improve the robustness and provenance for reference genomes used. Far too often we see minimal headers with zero information on what they are or where they came from. We could potentially make assembly information mandatory in newer headers, but then it needs defined vocabularies and a centralised list of assembly identifiers, etc. Sequence Collections basically already is that, with added utility.

So I agreed we'd open it up for discussion here and see where it goes.

jkbonfield avatar May 07 '25 08:05 jkbonfield

  • Don't want to repeat the sequence collection for every single SQ line. It boosts the file size and doesn't alleviate the maximum SAM header size.

  • Maybe use @HD (once per file) or @RG (once per "run")?

Do we want to limit us to just one collection? Eg when merging two files (a human and a pathogen?). Arguably forcing one only is a good thing, while also being a limitation. It's also possible to construct a new collection which is a union of the two, but we then need to publish it somewhere.

That therefore lends weight to using an HD SC:tag to identify the sequence collection.

Much like M5 tags, these would be looked up externally via an implementation specifc REF_PATH style mechanism or config file. I would also expect caching or some local ancillary file (akin to the .dict files) to avoid external network requirements.

We discussed whether SQ lines would still be mandatory. Everything in SQ could be in the collection, including topology as optional attributes, but for backwards compatibility it's not something I would envisage us actually removing any time soon. We probably want to make it optional, and when both SQ and HD/SC are present we need a rule about handling ambiguities.

More food for thought anyway. Please do chime in with pros, cons and alternative ideas.

jkbonfield avatar Jun 17 '25 15:06 jkbonfield