glow
glow copied to clipboard
Cannot write INFO fields with LongType to VCF
Example:
(
spark.read.parquet(INPUT_PATH)
.select(
f.col("chrom").alias("contigName"),
f.col("start"),
f.col("end"),
f.col("ref").alias("referenceAllele"),
f.array(f.col("alt")).alias("alternateAlleles"),
f.col("INFO_SVTYPE"),
f.col("INFO_END").astype(t.LongType()),
)
.write
.format("vcf")
.save(OUTPUT_PATH, mode="overwrite")
)
Fails with:
23/02/01 16:25:19 ERROR Executor: Exception in task 14.0 in stage 30.0 (TID 2513)
scala.MatchError: LongType (of class org.apache.spark.sql.types.LongType$)
at io.projectglow.vcf.VCFSchemaInferrer$.vcfDataType(VCFSchemaInferrer.scala:181)
at io.projectglow.vcf.VCFSchemaInferrer$.$anonfun$headerLinesFromSchema$2(VCFSchemaInferrer.scala:118)
at scala.collection.immutable.List.map(List.scala:297)
at io.projectglow.vcf.VCFSchemaInferrer$.headerLinesFromSchema(VCFSchemaInferrer.scala:116)
at io.projectglow.vcf.VCFHeaderUtils$.parseHeaderLinesAndSamples(VCFHeaderUtils.scala:74)
at io.projectglow.vcf.VCFOutputWriterFactory.newInstance(VCFFileFormat.scala:504)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:146)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:290)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Hi @Hoeze
I had the same issue, the VCF writer does not support LongType() for INFO
fields
The workaround is to cast to LongType() INFO
fields to IntegerType()
e.g.
from pyspark.sql.types import *
import pyspark.sql.functions as fx
vcf_df = vcf_df.withColumn("INFO_test", fx.col("INFO_test").cast(IntegerType())
I'll see if I can fix this.