htslib
htslib copied to clipboard
Extracting blocks of unaligned reads from CRAM
I was wondering if htslib/samtools support extraction of blocks of unaligned reads similar to what a bgzipped FASTQ file that has been indexed with grabix currently allows.
This is relevant for processing a single FASTQ across many compute nodes by allowing each process (such as a read mapper) to extract windows of reads (such as "the third block of 1M reads") without having to seek through the whole FASTQ file first (due to the index). For us, the ability to do so is highly relevant to replace the aged FASTQ with unaligned CRAM for WGS read mapping.
Note that this is different from extracting reads that align to a window of the reference genome; I would rather expect to be able to extract the unaligned reads in the order of the FASTQ file that has been used for generating the unaligned CRAM. A nice addition would be an option for filtering by mate ID so that one could specify which end (or both) should be returned. May this perhaps be possible while also limiting IO by using the blocks/slices of the CRAM format?
On Tue, Jan 19, 2016 at 08:51:38AM -0800, Sven-Eric Schelhorn wrote:
I was wondering if htslib/samtools support extraction of blocks of unaligned reads similar to what a bgzipped FASTQ file that has been indexed with grabix currently allows.
This is relevant for processing a single FASTQ across many compute nodes by allowing each process (such as a read mapper) to extract windows of reads (such as "the third block of 1M reads") without having to seek through the whole FASTQ file first (due to the index). For us, the ability to do so is highly relevant to replace the aged FASTQ with unaligned CRAM for WGS read mapping.
It's doable in an around-the-houses sort of way.
If you build an index of the cram file then the .crai file contains the offsets of each cram container / slice. You can then produce valid cram streams by simply using dd on the input file to stitch together the header block with other blocks, taking care to handle the eof case too. It's messy, but demonstrates the principle.
I've done this manually to produce small cram files demonstrating bugs, where a huge file has the issue and I want to produce a smaller test case.
The .crai file is just gzipped text. I wouldn't guarantee on it never changing, but you could explore whether this technique is enough to cover your needs. If so then we can consider getting a more robust and supported implementation into htslib to avoid external tools parsing the .crai themselves.
James
James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
James -- thanks so much. This sounds promising although the bits about producing valid cram streams with dd is beyond my CRAM skill level, in terms of producing a proof of principle implementation for bcbio. I could do the crai parsing bits and test it if I had a better interface or guidance how to do the retrieval. Thanks again for the ideas.
I admit it's esoteric! But here's a worked example.
$ zcat _.cram.crai
19 18868 136639 7362 210 103725
19 155366 95912 111322 212 93967
19 251145 86228 205527 214 120408
19 337223 88123 326176 214 118523
...[cut for brevity]...
19 62737042 84894 65556969 210 179619
19 62821820 83512 65736827 206 105559
19 62905183 60287 65842621 218 109915
These are reference numbers (counting from 0, so 19 being chr20 of this sample). Staden io_lib has a cram_dump utility (hideously written, but a debugging aid for me primarily) that reveals a bit more about the structure of this file.
$ cram_dump _.cram
File definition structure
Major vers: 3
Minor vers: 0
file_id: /tmp/_.cram
BAM header:
@HD VN:1.4 SO:coordinate
...[rest of header]...
Container pos 7362 size 103935
Ref id: 19
Ref pos: 18868 + 136639
Rec counter: 0
No. recs: 10000
No. bases 1466522
No. blocks: 24
No. landmarks: 1
{210}
...
... [lots more data ending with last real container] ...
Container pos 65842621 size 110133
Ref id: 19
Ref pos: 62905183 + 60287
Rec counter: 6510000
No. recs: 6772
No. bases 995898
No. blocks: 24
No. landmarks: 1
{218}
...[followed by EOF block, actually an empty container itself]...
Container pos 65952783 size 15
Ref id: -1
Ref pos: 4542278 + 0
Rec counter: 0
No. recs: 0
No. bases 0
No. blocks: 1
No. landmarks: 0
{}
This is showing the reference locations as well as the file offsets. Reference locations "id 19 pos 18868 + 136639" is the same data you're seeing in the first line of the index file. File locations "pos 7362 size 103935" are in there too. It's a bit different for the container size. The difference in file offset between 1st and 2nd container is 111322-7362 == 103960. This isn't very helpful, but the difference is due to whether or not it includes the container header structure itself. Therefore the easy solution is just to subtract start locations.
To stitch a valid cram file together, we need CRAM header, BAM header (ie all the bits between first container - the first 7362 bytes in my example), a series of containers, and the footer - the EOF block which for cram v3 is 38 bytes long. So for 9 containers taken out of the middle of the .crai file:
@ seq3a[/tmp]; zcat _.cram.crai|head -400|tail
19 39697477 89695 38701731 208 80472
19 39787022 90904 38782440 202 79128
19 39877779 93587 38861799 206 94698
19 39971219 85377 38956732 200 80250
19 40056448 90739 39037211 204 79074
19 40147042 92103 39116518 206 88262
19 40239024 91659 39205015 210 85474
19 40330540 92800 39290728 208 94931
19 40423192 93575 39385896 202 73863
19 40516618 93779 39459990 204 78303
@ seq3a[/tmp]; expr 39459990 - 38701731
758259
@ seq3a[/tmp]; ls -l _.cram
-rw-r--r-- 1 jkb team117 65952821 Jan 19 18:01 _.cram
@ seq3a[/tmp]; expr 65952821 - 38
65952783
@ seq3a[/tmp]; (dd if=_.cram bs=1 count=7362; dd if=_.cram bs=1 skip=38701731 count=758259; dd if=_.cram bs=1 skip=65952783) > _sub.cram
7362+0 records in
7362+0 records out
7362 bytes (7.4 kB) copied, 0.0157573 s, 467 kB/s
758259+0 records in
758259+0 records out
758259 bytes (758 kB) copied, 0.858084 s, 884 kB/s
38+0 records in
38+0 records out
38 bytes (38 B) copied, 4.7836e-05 s, 794 kB/s
@ seq3a[/tmp]; samtools view _sub.cram | wc
90000 1264796 34238109
Yes it's hideous, but it demonstrates in principle cut-n-shunt of CRAM can work. However I'm not sure if we'll get time to make a proper API for doing container splitting in CRAM in a graceful way soon.
I'm not promoting this as an official way of parallelising CRAM processing and there's no check in here for cram version number, whether the EOF is valid, even the .crai file format may change at some stage in the future. Hence the need for an abstracted API (one day). But meanwhile a home-brew hacky solution is possible.
@jkbonfield Can I resurrect this thread? The 100Gbp+ data coming off the PacBio Sequel II and ONT PromethION means that we are getting large unaligned files at the moment to begin the assembly process (BAM for PacBio at the moment, and maybe CRAM with https://github.com/EGA-archive/ont2cram for ONT). Many operations that were fine to be parallelised per-run for the old Sequel I and MinION yields would benefit from being able to shard these now much larger files into chunks of N reads. The thread was originally for CRAM, but ideally would work for BAM via the bai
or csi
index too...
This was implemented, but not in htslib. See io_lib's cram_filter
.
It's a bit clunky as it doesn't have an obvious way to tell you how many containers exist (other than when you get zero reads back I guess you know you're at the end), but for example:
$ samtools view -c /tmp/pb.cram
9999
$ cram_filter -n 5-10 /tmp/pb.cram - | samtools view -c
3878
Note you can do cram_index
on an unsorted file and then zcat pb.cram.crai | wc -l
to work out how many containers there are. (They start from zero for cram_filter.)
Doing it in BAM is harder. We can't index it if it isn't sorted and there's no way to do random access on unsorted files. In theory we could use the bzi index (from bgzip), but the BAM format has no way to indicate which blocks start on read boundaries and which do not. It's basically a bad format for this type of work.
Edit: I take that back, an unaligned file is "sorted" in as far as it'll permit bai/csi indexing. So maybe it's possible. I know people (not us) wrote tools to do chunking of BAM based on genomic coords and the indices, but I don't think anyone managed it for arbitrary slices of Nth read. Technically possible, but a mess.
Re BAM, this would probably be a use case for the proposed splitting index (see samtools/hts-specs#321) that hasn't been finalised (or implemented in HTSlib / samtools).