bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

dbSNP 156 VCF now includes non-32 bit integers causing "Extreme INFO/RS value encountered and set to missing" errors

Open freeseek opened this issue 2 years ago • 4 comments

With release 156, now dbSNP includes rsIDs larger than 2^31 which cannot be properly handled by bcftools anymore:

$ wget https://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.40.gz{,.tbi}
$ tabix GCF_000001405.40.gz NC_000001.11:6259533-6259533
NC_000001.11	6259533	rs2148352434	C	T	.	.	RS=2148352434;dbSNPBuildID=156;SSR=0;GENEINFO=GPR153:387509;VC=SNV;INT;R5;GNO;FREQ=1000Genomes:0.9998,0.0001562
$ bcftools view -H GCF_000001405.40.gz -r NC_000001.11:6259533-6259533
[W::vcf_parse_info] Extreme INFO/RS value encountered and set to missing at NC_000001.11:6259533
NC_000001.11	6259533	rs2148352434	C	T	.	.	RS=.;dbSNPBuildID=156;SSR=0;GENEINFO=GPR153:387509;VC=SNV;INT;R5;GNO;FREQ=1000Genomes:0.9998,0.0001562

If HTSlib is compiled with option -DVCF_ALLOW_INT64 then it works fine:

$ bcftools view -H GCF_000001405.40.gz -r NC_000001.11:6259533-6259533
NC_000001.11	6259533	rs2148352434	C	T	.	.	RS=2148352434;dbSNPBuildID=156;SSR=0;GENEINFO=GPR153:387509;VC=SNV;INT;R5;GNO;FREQ=1000Genomes:0.9998,0.0001562

However, this cannot be represented anymore as a binary VCF, which is a huge problem:

$ bcftools view -Ou GCF_000001405.40.gz -r NC_000001.11:6259533-6259533 | bcftools view -H
[E::bcf_write] Data at NC_000001.11:6259533 contains 64-bit values not representable in BCF. Please use VCF instead
[main_vcfview] Error: cannot write to (null)

Is there a discussion in samtools/hts-specs to get the BCF specification to update the specification to 64-bit values?

freeseek avatar Jul 13 '23 17:07 freeseek

Changing BCF specification is not an easy task and may take a long time even if there is a good will to do it. The problem could be addressed more easily at dbSNP side if the INFO/RS was a string rather than an integer.

pd3 avatar Jul 14 '23 08:07 pd3

Hi,

I am getting the same error when trying to annotate dbSNP 156. I understand from the discussion that this issue can't be fixed temporarily. But can you help me with compiling HTSlib with option -DVCF_ALLOW_INT64. I did read the documentation and it states that this option needs to be added manually in the makefile. I tried that and it's not working. I made this change in the makefile in the htslib-1.20 folder with bcftools-1.20. Since I have no experience in developing with C++ and make, could you please specify the exact changes to be made in the makefile? Is this correct? CFLAGS = -g -Wall -O2 -fvisibility=hidden -DVCF_ALLOW_INT64=1

ShrutiBaikerikar avatar Apr 17 '24 12:04 ShrutiBaikerikar

Yes, that is correct, one must compile with -DVCF_ALLOW_INT64. Try to force recompilation of vcf.c with touch vcf.c, see what the standard make command line looks like and add -DVCF_ALLOW_INT64. It should be noted that this has not been terribly well tested, hopefully the code did not deteriorate too much.

Perhaps a simpler workaround is to edit the VCF using the reheader command, changing the offending tag to Type=String

bcftools view -h file.vcf.gz > hdr.txt
# edit hdr.txt and change the offending tag to Type=String
reheader -h hdr.txt -o new.bcf file.vcf.gz

pd3 avatar Apr 17 '24 13:04 pd3

Hi,

Thanks for the solutions. I tried to recompile with touch vcf.c and the addition of -DVCF_ALLOW_INT64 in the makefile but the error persisted.

The second solution, which is changing the tag to Type=String, worked and I could successfully use bcftools view as well as bcftools annotate

Thank you very much for your help.

ShrutiBaikerikar avatar Apr 20 '24 15:04 ShrutiBaikerikar