cyvcf2 icon indicating copy to clipboard operation
cyvcf2 copied to clipboard

Writing a missing value for a FORMAT field

Open p69180 opened this issue 3 years ago • 7 comments

Hi,

Please understand that I do not have any experience with C or htslib.

I would like to write missing values ('.') to FORMAT fields.

I incidentally found that when the FORMAT field is of type 'Integer', entering a numpy array including -2147483648 results in missing value ('.'). But I could not find such a way with 'Float' type FORMAT field. I tried with numpy.nan, but the corresponding FORMAT field of output vcf file showed 'nan' instead of '.'.

Is there any way to write '.' in a FORMAT field?

Thanks

p69180 avatar Sep 28 '20 20:09 p69180

You can write 0x7F800001 or 2139095041 as the missing value.

brentp avatar Sep 28 '20 20:09 brentp

Thank you for a fast reply.

Unfortunately your suggestion did not work.

What I did:

import sys sys.path.append('/home/users/pjh/python/lib/python3.7/site-packages') import cyvcf2 import numpy as np

vcf_path = '/home/users/pjh/tmp/TEST.bcf.gz'

vcf = cyvcf2.VCF(vcf_path) vcf.add_format_to_header({'ID':'NEW', 'Type':'Float', 'Number':1, 'Description':'...'}) v = next(vcf) print(v) print()

NA = 0x7F800001 arr = np.array([[1.1], [NA]]) v.set_format('NEW', arr) print(v)


Output:

1 94533 . C A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=38,64|6,3;DP=116;ECNT=1;GERMQ=93;MBQ=37,37;MFRL=396,385;MMQ=23,40;MPOS=30;NALOD=1.5;NLOD=9.01;POPAF=6;ROQ=65;TLOD=19.97 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:72,9:0.12:81:39,5:33,4:24,48,6,3 0/0:30,0:0.031:30:16,0:14,0:14,16,0,0

1 94533 . C A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=38,64|6,3;DP=116;ECNT=1;GERMQ=93;MBQ=37,37;MFRL=396,385;MMQ=23,40;MPOS=30;NALOD=1.5;NLOD=9.01;POPAF=6;ROQ=65;TLOD=19.97 GT:AD:AF:DP:F1R2:F2R1:SB:NEW 0/1:72,9:0.12:81:39,5:33,4:24,48,6,3:1.1 0/0:30,0:0.031:30:16,0:14,0:14,16,0,0:2.1391e+09

p69180 avatar Sep 28 '20 22:09 p69180

you can use:

v.set_format('NEW', np.array([1.1, np.nan], dtype=np.float32))

that shows "nan" instead of "." and I'm not sure why yet.

brentp avatar Sep 28 '20 22:09 brentp

On the topic of missing values for FORMAT fields:

gt_alt_depths returns -1 for missing values while format() - returns -2147483648, which is -(2^31) or all bits set signed int32 - I think numpy converts NaN to this when casting to int

davmlaw avatar Oct 06 '20 00:10 davmlaw

Is there any way forward on this? I can confirm that currently setting a float format field with np.nans results in 'nan' being written to the output file, and that setting the field with 0x7F800001 or 2139095041 just results in that number, and not any missing value.

I'd like setting values to np.nan to result in '.' being written, as '.' is currently read in as np.nan. If that's not possible, then perhaps we could pass a masked array to set_format() and the masked values would take on the missing value? That could also be a consistent solution across datatypes. But I'll take whichever is faster/easier to implement.

LiterallyUniqueLogin avatar Nov 13 '20 20:11 LiterallyUniqueLogin

@brentp thoughts?

LiterallyUniqueLogin avatar Nov 18 '20 18:11 LiterallyUniqueLogin

Hi, I'm not sure hwo to fix this. I'd be happy to accept a PR to resolve it.

brentp avatar Nov 19 '20 15:11 brentp