glow
glow copied to clipboard
split_multiallelics replaces some genotypes fields by null and drops filters column
Issue can be reproduced on chrY data from 1000 genomes project (release 20130502), got it on io.projectglow:glow-spark3_2.12:0.6.0
and earlier versions. Please see code and explanations below
sourcefiles = "/mnt/1000-genomes/release/chrY.vcf.gz"
df = spark.read\
.format("vcf")\
.option("includeSampleIds", True)\
.option("flattenInfoFields", True)\
.load(sourcefiles)
where amongst other columns df has 2 columns:
filters:array
element:string
and
genotypes:array
element:struct
sampleId:string
conditionalQuality:integer
CNQ:double
filters:array
element:string
posteriorProbabilities:array
element:double
phased:boolean
calls:array
element:integer
phredLikelihoods:array
element:integer
CNL:array
element:double
CNP:array
element:double
CN:integer
Create subset where CNL or CNP is not null and apply split_multiallelics
cnv = df.where("INFO_SVTYPE = 'CNV'")
split_cnv = glow.transform(\
"split_multiallelics",\
cnv\
)
Problem 1 - split_cnv doesn't have column 'filters' anymore.
Check one of the multiallelic variants before and after the transformation
display(cnv.where("start = 6543372"))
display(split_cnv.where("start = 6543372"))
Problem 2 - CNL and CNP values were replaced by null for all multiallelic variants.
@olesya13
Problem 1: I don't see this issue in your screenshots.
Problem 2: Right now, the splitter only knows how to handle a few specific fields (GL, PL, GP, GT) as a well as array fields where the number of elements equals the number of alts (Number=A
). In this case, it looks like CNL has one element per possible genotype (Number=G
). We could actually make this work in the common case since the VCF reader puts the number in the column metadata. We'll look into it, but might take a little while.
cc @kianfar77
@henrydavidge
regarding Problem 1
cnv.dtypes
gives
[('contigName', 'string'), ('start', 'bigint'), ('end', 'bigint'), ('names', 'array
'), ('referenceAllele', 'string'), ('alternateAlleles', 'array '), ('qual', 'double'), ('filters', 'array '), ('splitFromMultiAllelic', 'boolean'), ('INFO_AC', 'array'), ('INFO_NS', 'int'), ('INFO_AFR_AF', 'array '), ('INFO_VT', 'array '), ('INFO_AN', 'int'), ('INFO_MULTI_ALLELIC', 'boolean'), ('INFO_SAS_AF', 'array '), ('INFO_AA', 'string'), ('INFO_AF', 'array '), ('INFO_EAS_AF', 'array '), ('INFO_AMR_AF', 'array '), ('INFO_DP', 'int'), ('INFO_END', 'int'), ('INFO_EUR_AF', 'array '), ('INFO_EX_TARGET', 'boolean'), ('INFO_SVTYPE', 'string'), ('genotypes', 'array<struct<sampleId:string,conditionalQuality:int,CNQ:double,filters:array ,posteriorProbabilities:array ,phased:boolean,calls:array ,phredLikelihoods:array ,CNL:array ,CNP:array ,CN:int>>')]
and split_cnv.dtypes
gives
[('contigName', 'string'), ('start', 'bigint'), ('end', 'bigint'), ('names', 'array
'), ('referenceAllele', 'string'), ('alternateAlleles', 'array '), ('qual', 'double'), ('splitFromMultiAllelic', 'boolean'), ('INFO_AC', 'array '), ('INFO_NS', 'int'), ('INFO_AFR_AF', 'array '), ('INFO_VT', 'array '), ('INFO_AN', 'int'), ('INFO_MULTI_ALLELIC', 'boolean'), ('INFO_SAS_AF', 'array '), ('INFO_AA', 'string'), ('INFO_AF', 'array '), ('INFO_EAS_AF', 'array '), ('INFO_AMR_AF', 'array '), ('INFO_DP', 'int'), ('INFO_END', 'int'), ('INFO_EUR_AF', 'array '), ('INFO_EX_TARGET', 'boolean'), ('INFO_SVTYPE', 'string'), ('INFO_OLD_MULTIALLELIC', 'string'), ('genotypes', 'array<struct<sampleId:string,conditionalQuality:int,CNQ:double,filters:array ,posteriorProbabilities:array ,phased:boolean,calls:array ,phredLikelihoods:array ,CNL:array ,CNP:array ,CN:int>>')]
where ('filters', 'array') disappeared