cyvcf2
cyvcf2 copied to clipboard
Writing a missing value for a FORMAT field
Hi,
Please understand that I do not have any experience with C or htslib.
I would like to write missing values ('.') to FORMAT fields.
I incidentally found that when the FORMAT field is of type 'Integer', entering a numpy array including -2147483648 results in missing value ('.'). But I could not find such a way with 'Float' type FORMAT field. I tried with numpy.nan, but the corresponding FORMAT field of output vcf file showed 'nan' instead of '.'.
Is there any way to write '.' in a FORMAT field?
Thanks
You can write 0x7F800001
or 2139095041
as the missing value.
Thank you for a fast reply.
Unfortunately your suggestion did not work.
What I did:
import sys sys.path.append('/home/users/pjh/python/lib/python3.7/site-packages') import cyvcf2 import numpy as np
vcf_path = '/home/users/pjh/tmp/TEST.bcf.gz'
vcf = cyvcf2.VCF(vcf_path) vcf.add_format_to_header({'ID':'NEW', 'Type':'Float', 'Number':1, 'Description':'...'}) v = next(vcf) print(v) print()
NA = 0x7F800001 arr = np.array([[1.1], [NA]]) v.set_format('NEW', arr) print(v)
Output:
1 94533 . C A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=38,64|6,3;DP=116;ECNT=1;GERMQ=93;MBQ=37,37;MFRL=396,385;MMQ=23,40;MPOS=30;NALOD=1.5;NLOD=9.01;POPAF=6;ROQ=65;TLOD=19.97 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:72,9:0.12:81:39,5:33,4:24,48,6,3 0/0:30,0:0.031:30:16,0:14,0:14,16,0,0
1 94533 . C A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=38,64|6,3;DP=116;ECNT=1;GERMQ=93;MBQ=37,37;MFRL=396,385;MMQ=23,40;MPOS=30;NALOD=1.5;NLOD=9.01;POPAF=6;ROQ=65;TLOD=19.97 GT:AD:AF:DP:F1R2:F2R1:SB:NEW 0/1:72,9:0.12:81:39,5:33,4:24,48,6,3:1.1 0/0:30,0:0.031:30:16,0:14,0:14,16,0,0:2.1391e+09
you can use:
v.set_format('NEW', np.array([1.1, np.nan], dtype=np.float32))
that shows "nan" instead of "." and I'm not sure why yet.
On the topic of missing values for FORMAT fields:
gt_alt_depths
returns -1 for missing values while
format()
- returns -2147483648, which is -(2^31) or all bits set signed int32 - I think numpy converts NaN to this when casting to int
Is there any way forward on this? I can confirm that currently setting a float format field with np.nans results in 'nan' being written to the output file, and that setting the field with 0x7F800001
or 2139095041
just results in that number, and not any missing value.
I'd like setting values to np.nan to result in '.' being written, as '.' is currently read in as np.nan. If that's not possible, then perhaps we could pass a masked array to set_format() and the masked values would take on the missing value? That could also be a consistent solution across datatypes. But I'll take whichever is faster/easier to implement.
@brentp thoughts?
Hi, I'm not sure hwo to fix this. I'd be happy to accept a PR to resolve it.