htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Extracting blocks of unaligned reads from CRAM

Open schelhorn opened this issue 9 years ago • 6 comments

I was wondering if htslib/samtools support extraction of blocks of unaligned reads similar to what a bgzipped FASTQ file that has been indexed with grabix currently allows.

This is relevant for processing a single FASTQ across many compute nodes by allowing each process (such as a read mapper) to extract windows of reads (such as "the third block of 1M reads") without having to seek through the whole FASTQ file first (due to the index). For us, the ability to do so is highly relevant to replace the aged FASTQ with unaligned CRAM for WGS read mapping.

Note that this is different from extracting reads that align to a window of the reference genome; I would rather expect to be able to extract the unaligned reads in the order of the FASTQ file that has been used for generating the unaligned CRAM. A nice addition would be an option for filtering by mate ID so that one could specify which end (or both) should be returned. May this perhaps be possible while also limiting IO by using the blocks/slices of the CRAM format?

schelhorn avatar Jan 19 '16 16:01 schelhorn

On Tue, Jan 19, 2016 at 08:51:38AM -0800, Sven-Eric Schelhorn wrote:

I was wondering if htslib/samtools support extraction of blocks of unaligned reads similar to what a bgzipped FASTQ file that has been indexed with grabix currently allows.

This is relevant for processing a single FASTQ across many compute nodes by allowing each process (such as a read mapper) to extract windows of reads (such as "the third block of 1M reads") without having to seek through the whole FASTQ file first (due to the index). For us, the ability to do so is highly relevant to replace the aged FASTQ with unaligned CRAM for WGS read mapping.

It's doable in an around-the-houses sort of way.

If you build an index of the cram file then the .crai file contains the offsets of each cram container / slice. You can then produce valid cram streams by simply using dd on the input file to stitch together the header block with other blocks, taking care to handle the eof case too. It's messy, but demonstrates the principle.

I've done this manually to produce small cram files demonstrating bugs, where a huge file has the issue and I want to produce a smaller test case.

The .crai file is just gzipped text. I wouldn't guarantee on it never changing, but you could explore whether this technique is enough to cover your needs. If so then we can consider getting a more robust and supported implementation into htslib to avoid external tools parsing the .crai themselves.

James

James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

jkbonfield avatar Jan 19 '16 17:01 jkbonfield

James -- thanks so much. This sounds promising although the bits about producing valid cram streams with dd is beyond my CRAM skill level, in terms of producing a proof of principle implementation for bcbio. I could do the crai parsing bits and test it if I had a better interface or guidance how to do the retrieval. Thanks again for the ideas.

chapmanb avatar Jan 19 '16 19:01 chapmanb

I admit it's esoteric! But here's a worked example.

$ zcat _.cram.crai
19      18868   136639  7362    210     103725
19      155366  95912   111322  212     93967
19      251145  86228   205527  214     120408
19      337223  88123   326176  214     118523
...[cut for brevity]...
19  62737042    84894   65556969    210 179619
19  62821820    83512   65736827    206 105559
19  62905183    60287   65842621    218 109915

These are reference numbers (counting from 0, so 19 being chr20 of this sample). Staden io_lib has a cram_dump utility (hideously written, but a debugging aid for me primarily) that reveals a bit more about the structure of this file.

$ cram_dump _.cram
File definition structure
    Major vers: 3
    Minor vers: 0
    file_id:    /tmp/_.cram

BAM header:
@HD     VN:1.4  SO:coordinate
...[rest of header]...

Container pos 7362 size 103935
    Ref id:            19
    Ref pos:           18868 + 136639
    Rec counter:       0
    No. recs:          10000
    No. bases          1466522
    No. blocks:        24
    No. landmarks:     1
    {210}
...
... [lots more data ending with last real container] ...
Container pos 65842621 size 110133
    Ref id:            19
    Ref pos:           62905183 + 60287
    Rec counter:       6510000
    No. recs:          6772
    No. bases          995898
    No. blocks:        24
    No. landmarks:     1
    {218}

...[followed by EOF block, actually an empty container itself]...
Container pos 65952783 size 15
    Ref id:            -1
    Ref pos:           4542278 + 0
    Rec counter:       0
    No. recs:          0
    No. bases          0
    No. blocks:        1
    No. landmarks:     0
    {}

This is showing the reference locations as well as the file offsets. Reference locations "id 19 pos 18868 + 136639" is the same data you're seeing in the first line of the index file. File locations "pos 7362 size 103935" are in there too. It's a bit different for the container size. The difference in file offset between 1st and 2nd container is 111322-7362 == 103960. This isn't very helpful, but the difference is due to whether or not it includes the container header structure itself. Therefore the easy solution is just to subtract start locations.

To stitch a valid cram file together, we need CRAM header, BAM header (ie all the bits between first container - the first 7362 bytes in my example), a series of containers, and the footer - the EOF block which for cram v3 is 38 bytes long. So for 9 containers taken out of the middle of the .crai file:

@ seq3a[/tmp]; zcat _.cram.crai|head -400|tail
19  39697477    89695   38701731    208 80472
19  39787022    90904   38782440    202 79128
19  39877779    93587   38861799    206 94698
19  39971219    85377   38956732    200 80250
19  40056448    90739   39037211    204 79074
19  40147042    92103   39116518    206 88262
19  40239024    91659   39205015    210 85474
19  40330540    92800   39290728    208 94931
19  40423192    93575   39385896    202 73863
19  40516618    93779   39459990    204 78303
@ seq3a[/tmp]; expr 39459990 - 38701731
758259
@ seq3a[/tmp]; ls -l _.cram
-rw-r--r-- 1 jkb team117 65952821 Jan 19 18:01 _.cram
@ seq3a[/tmp]; expr 65952821 - 38
65952783
@ seq3a[/tmp]; (dd if=_.cram bs=1 count=7362; dd if=_.cram bs=1 skip=38701731 count=758259; dd if=_.cram bs=1 skip=65952783) > _sub.cram
7362+0 records in
7362+0 records out
7362 bytes (7.4 kB) copied, 0.0157573 s, 467 kB/s
758259+0 records in
758259+0 records out
758259 bytes (758 kB) copied, 0.858084 s, 884 kB/s
38+0 records in
38+0 records out
38 bytes (38 B) copied, 4.7836e-05 s, 794 kB/s
@ seq3a[/tmp]; samtools view _sub.cram | wc
  90000 1264796 34238109

Yes it's hideous, but it demonstrates in principle cut-n-shunt of CRAM can work. However I'm not sure if we'll get time to make a proper API for doing container splitting in CRAM in a graceful way soon.

I'm not promoting this as an official way of parallelising CRAM processing and there's no check in here for cram version number, whether the EOF is valid, even the .crai file format may change at some stage in the future. Hence the need for an abstracted API (one day). But meanwhile a home-brew hacky solution is possible.

jkbonfield avatar Jan 20 '16 09:01 jkbonfield

@jkbonfield Can I resurrect this thread? The 100Gbp+ data coming off the PacBio Sequel II and ONT PromethION means that we are getting large unaligned files at the moment to begin the assembly process (BAM for PacBio at the moment, and maybe CRAM with https://github.com/EGA-archive/ont2cram for ONT). Many operations that were fine to be parallelised per-run for the old Sequel I and MinION yields would benefit from being able to shard these now much larger files into chunks of N reads. The thread was originally for CRAM, but ideally would work for BAM via the bai or csi index too...

mcshane avatar May 28 '19 16:05 mcshane

This was implemented, but not in htslib. See io_lib's cram_filter.

It's a bit clunky as it doesn't have an obvious way to tell you how many containers exist (other than when you get zero reads back I guess you know you're at the end), but for example:

$ samtools view -c /tmp/pb.cram
9999

$ cram_filter -n 5-10 /tmp/pb.cram - | samtools view -c
3878

Note you can do cram_index on an unsorted file and then zcat pb.cram.crai | wc -l to work out how many containers there are. (They start from zero for cram_filter.)

Doing it in BAM is harder. We can't index it if it isn't sorted and there's no way to do random access on unsorted files. In theory we could use the bzi index (from bgzip), but the BAM format has no way to indicate which blocks start on read boundaries and which do not. It's basically a bad format for this type of work.

Edit: I take that back, an unaligned file is "sorted" in as far as it'll permit bai/csi indexing. So maybe it's possible. I know people (not us) wrote tools to do chunking of BAM based on genomic coords and the indices, but I don't think anyone managed it for arbitrary slices of Nth read. Technically possible, but a mess.

jkbonfield avatar May 29 '19 08:05 jkbonfield

Re BAM, this would probably be a use case for the proposed splitting index (see samtools/hts-specs#321) that hasn't been finalised (or implemented in HTSlib / samtools).

jmarshall avatar May 29 '19 08:05 jmarshall