glow split_multiallelics replaces some genotypes fields by null and drops filters column

Issue can be reproduced on chrY data from 1000 genomes project (release 20130502), got it on io.projectglow:glow-spark3_2.12:0.6.0 and earlier versions. Please see code and explanations below

sourcefiles = "/mnt/1000-genomes/release/chrY.vcf.gz"
df = spark.read\
  .format("vcf")\
  .option("includeSampleIds", True)\
  .option("flattenInfoFields", True)\
  .load(sourcefiles)

where amongst other columns df has 2 columns:

filters:array
  element:string

and

genotypes:array 
  element:struct
    sampleId:string
    conditionalQuality:integer
    CNQ:double
    filters:array
      element:string
    posteriorProbabilities:array
      element:double
    phased:boolean
    calls:array
      element:integer
    phredLikelihoods:array
      element:integer
    CNL:array
      element:double
    CNP:array
      element:double
    CN:integer

Create subset where CNL or CNP is not null and apply split_multiallelics

cnv = df.where("INFO_SVTYPE = 'CNV'")
split_cnv = glow.transform(\
  "split_multiallelics",\
   cnv\
)

Problem 1 - split_cnv doesn't have column 'filters' anymore.

Check one of the multiallelic variants before and after the transformation

display(cnv.where("start = 6543372"))

display(split_cnv.where("start = 6543372"))

Problem 2 - CNL and CNP values were replaced by null for all multiallelic variants.

Dec 02 '20 22:12 olesya13

@olesya13 Problem 1: I don't see this issue in your screenshots. Problem 2: Right now, the splitter only knows how to handle a few specific fields (GL, PL, GP, GT) as a well as array fields where the number of elements equals the number of alts (Number=A). In this case, it looks like CNL has one element per possible genotype (Number=G). We could actually make this work in the common case since the VCF reader puts the number in the column metadata. We'll look into it, but might take a little while.

cc @kianfar77

Dec 04 '20 20:12 henrydavidge

@henrydavidge regarding Problem 1 cnv.dtypes gives

[('contigName', 'string'), ('start', 'bigint'), ('end', 'bigint'), ('names', 'array'), ('referenceAllele', 'string'), ('alternateAlleles', 'array'), ('qual', 'double'), ('filters', 'array'), ('splitFromMultiAllelic', 'boolean'), ('INFO_AC', 'array'), ('INFO_NS', 'int'), ('INFO_AFR_AF', 'array'), ('INFO_VT', 'array'), ('INFO_AN', 'int'), ('INFO_MULTI_ALLELIC', 'boolean'), ('INFO_SAS_AF', 'array'), ('INFO_AA', 'string'), ('INFO_AF', 'array'), ('INFO_EAS_AF', 'array'), ('INFO_AMR_AF', 'array'), ('INFO_DP', 'int'), ('INFO_END', 'int'), ('INFO_EUR_AF', 'array'), ('INFO_EX_TARGET', 'boolean'), ('INFO_SVTYPE', 'string'), ('genotypes', 'array<struct<sampleId:string,conditionalQuality:int,CNQ:double,filters:array,posteriorProbabilities:array,phased:boolean,calls:array,phredLikelihoods:array,CNL:array,CNP:array,CN:int>>')]

and split_cnv.dtypes gives

[('contigName', 'string'), ('start', 'bigint'), ('end', 'bigint'), ('names', 'array'), ('referenceAllele', 'string'), ('alternateAlleles', 'array'), ('qual', 'double'), ('splitFromMultiAllelic', 'boolean'), ('INFO_AC', 'array'), ('INFO_NS', 'int'), ('INFO_AFR_AF', 'array'), ('INFO_VT', 'array'), ('INFO_AN', 'int'), ('INFO_MULTI_ALLELIC', 'boolean'), ('INFO_SAS_AF', 'array'), ('INFO_AA', 'string'), ('INFO_AF', 'array'), ('INFO_EAS_AF', 'array'), ('INFO_AMR_AF', 'array'), ('INFO_DP', 'int'), ('INFO_END', 'int'), ('INFO_EUR_AF', 'array'), ('INFO_EX_TARGET', 'boolean'), ('INFO_SVTYPE', 'string'), ('INFO_OLD_MULTIALLELIC', 'string'), ('genotypes', 'array<struct<sampleId:string,conditionalQuality:int,CNQ:double,filters:array,posteriorProbabilities:array,phased:boolean,calls:array,phredLikelihoods:array,CNL:array,CNP:array,CN:int>>')]

where ('filters', 'array') disappeared

Dec 08 '20 05:12 olesya13

glow glow copied to clipboard

split_multiallelics replaces some genotypes fields by null and drops filters column

glow
glow copied to clipboard