gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Enable CSI index reading for bgzipped VCF files

Open tfenne opened this issue 5 years ago • 7 comments

Feature request

Tool(s) or class(es) involved

Any tools that read VCF, but specifically GenotypeGVCFs

Description

I'm doing work where I'm working with genomes that have chromosomes that are too long for both BAI and tabix index formats. I'm working around the problem for BAMs by disabling on-the-fly index generation in Picard/GATK based tools and then running samtools index --csi to generate the CSI index, which GATK will happily use.

Then I ran into the exact same problem with VCFs. If I'm using bgzipped VCFs then I have to disable index creation in the GATK as it will fail when it hits a feature with a position higher than 512 * 2^20. It's possible to then generate a CSI index using (surprisingly) tabix. But I can't find a way to get the GATK to detect and use a CSI index for a bgzipped VCF. I think almost everything that is needed is there in HTSJDK, I think it's just a case of auto-detecting the .csi index.

I'm working around this for now by using uncompressed VCFs as the .idx format doesn't have the same limit. But it's not great having uncompressed VCFs.

Bonus: it would be nice if the GATK auto-defaulted index creation for bgzipped VCFs to off if any of the sequences in the sequence dictionary is longer than is supported by tabix.

tfenne avatar Aug 22 '19 22:08 tfenne

@tfenne I don't think htsjdk supports CSI for vcf. I'm pretty sure it was only wired up for bam.

Defaulting to false when the references are too big is a good idea.

lbergelson avatar Aug 23 '19 19:08 lbergelson

Having the same issue here, and I follow essentially the same steps with .csi and .idx indexing for bam and .g.vcf files, respectively. @tfenne or anyone else, have you figured a workaround to worked with compressed VCF files properly indexed for large chromosomes (> 512 * 2^20)?

I would have to carry ~1000 uncompressed *.g.vcf to GenomicsDBImport and I simply don't have the disk-space for that manoeuvre.

frabanal avatar May 05 '20 16:05 frabanal

doi: 10.1093/gigascience/giab007 image

shenweima avatar Sep 17 '21 04:09 shenweima

This is still a problem. Is anyone working on it?

ClayBirkett avatar Feb 11 '22 15:02 ClayBirkett

I do have the same problem with samtools indexing, in order to use this for GATK I need it in .bai index, .csi index is not supported in GATK! Error: samtools index ${filenm_root}.cutad.sort.bam [E::hts_idx_push] Region 537233901..537233984 cannot be stored in a bai index. Try using a csi index with min_shift = 14, n_lvls >= 6

gvarmaslu avatar Feb 22 '22 09:02 gvarmaslu

It is time to solve ...?

shenweima avatar Feb 23 '22 00:02 shenweima

I have the same problem when I try BaseRecalibrator. A USER ERROR has occurred: Can not read file://Users/....../file_name.vcf.tbi because no suitable codecs found

Floating-Element avatar Jun 17 '22 05:06 Floating-Element

We also have the same issue. @nvnieuwk

matthdsm avatar Feb 17 '23 10:02 matthdsm