goleft icon indicating copy to clipboard operation
goleft copied to clipboard

covstats supporting cram

Open hdashnow opened this issue 6 years ago • 10 comments

Any chance of covstats supporting cram format? Unsurprisingly, when I tried I got an error:

$ goleft covstats input.cram

panic: gzip: invalid header

goroutine 1 [running]:
github.com/brentp/goleft/covstats.pcheck(0xa8ae60, 0xc4200d8410)
	/home/brentp/go/src/github.com/brentp/goleft/covstats/covstats.go:28 +0x4a
github.com/brentp/goleft/covstats.Main()
	/home/brentp/go/src/github.com/brentp/goleft/covstats/covstats.go:231 +0x913
main.main()
	/home/brentp/go/src/github.com/brentp/goleft/cmd/goleft/goleft.go:68 +0x17f```

hdashnow avatar Jun 21 '18 02:06 hdashnow

P.S. I think I've figured out how to calculate median coverage from the mosdepth output. It's not as fast as covstats, but still faster than most other tools out there.

hdashnow avatar Jun 21 '18 07:06 hdashnow

I think it should be straight-forward to support CRAM in covstats, I'll just do a system call to samtools view. I'll have a look at this next week if not sooner.

brentp avatar Jun 21 '18 13:06 brentp

Much appreciated, thank you!

hdashnow avatar Jun 22 '18 06:06 hdashnow

I just had a look at this and the cram index does not store the total number of mapped reads so it's not possible to estimate the coverage like that as it is for the bam index. we could iterate over a few cram slices, count reads, and not the byte offsets in the index, then use that rate and the total file size to estimate. this will be relatively accurate, but less-so than even the bam index estimate.

brentp avatar Jun 22 '18 15:06 brentp

Are there any updates on this? I've been trying to implement a workflow involving covstats, and when it runs on CRAM files, the coverage reports as zero. Other stats appear to be accurate.

aofarrel avatar Oct 26 '20 18:10 aofarrel

given the lack of cram parser in go, I don't plan to support this. It's possible to get actual coverage for a 30X WGS cram in < 5 minutes using mosdepth.

brentp avatar Oct 26 '20 19:10 brentp

Thanks for your reply. I've noticed that covstats seems to run slowly on crams, even small ones. When you say there's no cram parser in go... what is it doing for the other stats? Apologies if this is a silly question, I'm very new to go.

In other words -- is covstats running on crams supported, just not for coverage? Or should I consider crams as not supported, period?

aofarrel avatar Oct 26 '20 19:10 aofarrel

Actually, I forgot what it was doing. It makes a system call to samtools which converts cram to bam, then parses the bam. So the sampling stuff (the parts that you notice are filled out from covstats) work fine with cram. But the mapped reads is taken from the bam index and not present in the cram index so it can't estimate coverage.

brentp avatar Oct 26 '20 20:10 brentp

What would you recommend to very quickly extract mean/median coverage (+the insert size info) for CRAMs? 5 mins with mosdepth is pretty good but say we don't want that much info, just the median depth.

Would it make sense to slice the CRAM to a few random regions, convert to a small BAM and then use covstats? We'd need to do this manually outside covstats, right? I see a "Regions" option but I'm not sure how it's used in the code, including if the full CRAM would be converted to BAM before computing coverage on these regions or not.

jmonlong avatar Oct 27 '20 06:10 jmonlong

I suppose I could make it calculate the mean for the first part of the chromosome with covstats. As soon as you starting converting to bam, you're going to be going up in time, even if it's only part of the file.

brentp avatar Oct 27 '20 12:10 brentp