gatk
gatk copied to clipboard
Enable CSI index reading for bgzipped VCF files
Feature request
Tool(s) or class(es) involved
Any tools that read VCF, but specifically GenotypeGVCFs
Description
I'm doing work where I'm working with genomes that have chromosomes that are too long for both BAI and tabix index formats. I'm working around the problem for BAMs by disabling on-the-fly index generation in Picard/GATK based tools and then running samtools index --csi
to generate the CSI index, which GATK will happily use.
Then I ran into the exact same problem with VCFs. If I'm using bgzipped VCFs then I have to disable index creation in the GATK as it will fail when it hits a feature with a position higher than 512 * 2^20
. It's possible to then generate a CSI index using (surprisingly) tabix
. But I can't find a way to get the GATK to detect and use a CSI index for a bgzipped VCF. I think almost everything that is needed is there in HTSJDK, I think it's just a case of auto-detecting the .csi index.
I'm working around this for now by using uncompressed VCFs as the .idx format doesn't have the same limit. But it's not great having uncompressed VCFs.
Bonus: it would be nice if the GATK auto-defaulted index creation for bgzipped VCFs to off if any of the sequences in the sequence dictionary is longer than is supported by tabix.
@tfenne I don't think htsjdk supports CSI for vcf. I'm pretty sure it was only wired up for bam.
Defaulting to false when the references are too big is a good idea.
Having the same issue here, and I follow essentially the same steps with .csi and .idx indexing for bam and .g.vcf files, respectively. @tfenne or anyone else, have you figured a workaround to worked with compressed VCF files properly indexed for large chromosomes (> 512 * 2^20)?
I would have to carry ~1000 uncompressed *.g.vcf to GenomicsDBImport and I simply don't have the disk-space for that manoeuvre.
doi: 10.1093/gigascience/giab007
This is still a problem. Is anyone working on it?
I do have the same problem with samtools indexing, in order to use this for GATK I need it in .bai index, .csi index is not supported in GATK! Error: samtools index ${filenm_root}.cutad.sort.bam [E::hts_idx_push] Region 537233901..537233984 cannot be stored in a bai index. Try using a csi index with min_shift = 14, n_lvls >= 6
It is time to solve ...?
I have the same problem when I try BaseRecalibrator. A USER ERROR has occurred: Can not read file://Users/....../file_name.vcf.tbi because no suitable codecs found
We also have the same issue. @nvnieuwk