ga4gh-server icon indicating copy to clipboard operation
ga4gh-server copied to clipboard

CRAM support feasibility

Open jeromekelleher opened this issue 9 years ago • 10 comments

An initial quick hack seems to imply that CRAM support might be quite easy to add. The only difficult bit is probably ensuring that the required references are present, and telling htslib where to find them. We will need to support CRAM at some point, so it would be good to do a quick feasibility analysis to see what we would need to do.

@andrewjesaitis or @dcolligan, would either of you like to take a look at this? The CRAM data from 1000 Genomes mapped to GRCh38 seems like a good place to start, e.g.,

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/method_development/cram/lossy-bins-no-BQ-tag/HG01242

If it works out OK, we'd like to add support for CRAM to the server and update the download_data script to include some CRAM data and the corresponding reference information. Updating the headers to point to the new reference subset will be essential I think.

jeromekelleher avatar Feb 12 '16 12:02 jeromekelleher

Sure. I can take a look at supporting CRAM. I imagine it will be pretty similar to BAM files, just reading via htslib?

andrewjesaitis avatar Feb 12 '16 17:02 andrewjesaitis

If you run into any trouble supporting CRAM using pysam, please reach out to Andreas and me. We're happy to rapidly address any problems that you may find.

On Fri, Feb 12, 2016 at 10:05 AM, Andrew Jesaitis [email protected] wrote:

Sure. I can take a look at supporting CRAM. I imagine it will be pretty similar to BAM files, just reading via htslib?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/server/issues/928#issuecomment-183414706.

bioinformed avatar Feb 12 '16 18:02 bioinformed

Thanks @andrewjesaitis, that would be great. My initial playing with it suggests that nearly everything should 'just-work' by adding '*.cram' to the file lists when creating a HtslibReadGroupSet, but there were a few quirks.

Thanks @bioinformed, we'll keep you posted if we hit any issues.

jeromekelleher avatar Feb 15 '16 09:02 jeromekelleher

So I got this hacked in this morning. Turned out to be really simple, just had to make the server look for the cram files and make sure that pysam opens them with the c flag. It's located over in this commit.

I do get an AssertionError from the pagination code, that I think has to do with unmapped read handling. This occurred when I started querying over explicit genomic ranges.

By far the biggest difficulty was getting the data set up correctly. I'm sure some of this is my inexperience with cram files, but it seems that due to the querying and caching that htslib is doing under the hood the CRAM headers are extremely fragile. This posed a bit of a problem since the server expects the reference to be broken into a fasta per chromosome and the 1kg reads data was aligned against a single fasta (hs37d5.fa). Since I was only using reads on chr20, I subsetted the fasta to the single chromosome and I needed to update the header to point at the subsetted fasta. I couldn't modify the header in a way that samtools would accept. Eventually, I gave up and realigned the reads against my single chromosome fasta. After converting this BAM to CRAM, I was able to place it in the the datarepo and get things working.

Also, it's good to note that the cram files in the 1kg directory linked above were unreadable by htslib/pysam. I get a ValueError: file header is empty (mode='rc') - is it SAM/BAM format? when I try to read them. Ultimately I just grabbed a BAM and converted it to CRAM using samtools. I also tested converting using the cramtools and that worked too.

andrewjesaitis avatar Feb 27 '16 22:02 andrewjesaitis

Excellent, thanks @andrewjesaitis. OK, so I think a pre-requisite for CRAM support will be allowing references be stored in single FASTA files. I've opened issue #955 for this.

It would be good to know why the above CRAMs were unreadable in pysam. Would you mind digging into this a little bit more, and maybe opening an issue upstream if the problem persists? Converting to BAM and back won't really be an option if we want to deploy on the entire 1kg dataset!

jeromekelleher avatar Feb 29 '16 10:02 jeromekelleher

Turned out to be really simple, just had to […] make sure that pysam opens them with the c flag.

When opening for reading with plain "r", the underlying HTSlib auto-detects SAM/BAM/CRAM,¹ so you shouldn't have needed to do this part. IMHO this is a pysam bug: its read mode auto-detection has not been updated for HTSlib auto-detection, and to my mind it doesn't need the highlighted lines.

Or with the existing pysam, you should be able to get HTSlib'a auto-detection and not need to check the file extension just by writing mode explicitly: return pysam.AlignmentFile(dataFile, 'r').

¹ Or VCF/BCF if you're opening a variants file, etc

jmarshall avatar Feb 29 '16 11:02 jmarshall

I agree it would be nice to know a bit more about why CRAM support was failing in conditions other than having to provide an alternative reference file. @AndreasHeger is preparing the next pysam release, so any details would be very timely.

bioinformed avatar Feb 29 '16 14:02 bioinformed

The lines that @jmarshall identified can definitely go, they were used to auto-detect SAM or BAM in early versions of samtools. Will do this before the - very imminent - release.

See pysam-developers/pysam#242

AndreasHeger avatar Feb 29 '16 17:02 AndreasHeger

Just tried reading a cram with various flags using pysam 0.8.4. Seemed that only 'rc' worked. With no flag, 'r', and 'rb' I get a ValueError: fetch called on bamfile without index.

Currently are opening bam files right now without passing a flag, but if it auto-detected using just a 'r' flag that would be sufficient. I don't see a case were we are writing back to the bam file, is that right @jeromekelleher?

I think the deal with those 1kg files are that they simply don't have a header. Since the @SQ lines are so central to CRAM format, I think this is really an issue with how those files were generated. I'm not sure how widespread the issue is. The files I looked at were: HG01242.chrom11.ILLUMINA.bwa.PUR.exome.20111114_lossy-bins-no-BQ-tags.bam.cram HG01242.chrom20.ILLUMINA.bwa.PUR.exome.20111114_lossy-bins-no-BQ-tags.bam.cram

from: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/method_development/cram/lossy-bins-no-BQ-tag/HG01242

andrewjesaitis avatar Feb 29 '16 17:02 andrewjesaitis

Currently are opening bam files right now without passing a flag, but if it auto-detected using just a 'r' flag that would be sufficient. I don't see a case were we are writing back to the bam file, is that right @jeromekelleher?

Yep, we have no intention of writing to the BAM/CRAM files ever as far as I can tell.

jeromekelleher avatar Feb 29 '16 18:02 jeromekelleher