glow icon indicating copy to clipboard operation
glow copied to clipboard

Cannot write INFO fields with LongType to VCF

Open Hoeze opened this issue 2 years ago • 2 comments

Example:

(
    spark.read.parquet(INPUT_PATH)
    .select(
        f.col("chrom").alias("contigName"),
        f.col("start"),
        f.col("end"),
        f.col("ref").alias("referenceAllele"),
        f.array(f.col("alt")).alias("alternateAlleles"),
        f.col("INFO_SVTYPE"),
        f.col("INFO_END").astype(t.LongType()),
    )
    .write
    .format("vcf")
    .save(OUTPUT_PATH, mode="overwrite")
)

Fails with:

23/02/01 16:25:19 ERROR Executor: Exception in task 14.0 in stage 30.0 (TID 2513)
scala.MatchError: LongType (of class org.apache.spark.sql.types.LongType$)
	at io.projectglow.vcf.VCFSchemaInferrer$.vcfDataType(VCFSchemaInferrer.scala:181)
	at io.projectglow.vcf.VCFSchemaInferrer$.$anonfun$headerLinesFromSchema$2(VCFSchemaInferrer.scala:118)
	at scala.collection.immutable.List.map(List.scala:297)
	at io.projectglow.vcf.VCFSchemaInferrer$.headerLinesFromSchema(VCFSchemaInferrer.scala:116)
	at io.projectglow.vcf.VCFHeaderUtils$.parseHeaderLinesAndSamples(VCFHeaderUtils.scala:74)
	at io.projectglow.vcf.VCFOutputWriterFactory.newInstance(VCFFileFormat.scala:504)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:146)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:290)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:229)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Hoeze avatar Feb 01 '23 15:02 Hoeze

Hi @Hoeze

I had the same issue, the VCF writer does not support LongType() for INFO fields The workaround is to cast to LongType() INFO fields to IntegerType()

e.g.

from pyspark.sql.types import *
import pyspark.sql.functions as fx

vcf_df = vcf_df.withColumn("INFO_test", fx.col("INFO_test").cast(IntegerType())

williambrandler avatar Feb 24 '23 21:02 williambrandler

I'll see if I can fix this.

henrydavidge avatar Feb 01 '24 04:02 henrydavidge