BigBgen Conversion Produces Empty Probabilities Field

Open RonShakutai opened this issue 1 year ago • 1 comments

Description: I am encountering an issue when trying to convert a VCF file to BGEN format using the Glow library on Databricks in Azure.

Context: I am using the following code to read a VCF file, limit it to 500 rows, and then save it in BigBgen format. However, after loading the BGEN file, the probabilities field in the genotypes column is empty.

Issue: The probabilities field in the genotypes column of the generated BGEN file is empty when the file is loaded back into Spark. This issue occurs even when using a smaller subset of the VCF data.

Expected Behavior: The probabilities field should contain non-empty values corresponding to the genotype probabilities.

Environment: Databricks on Azure

Documentation Reference: I have followed the BGEN conversion process as outlined in the Glow documentation here. However, the outcome does not match the expected results.

Could you please help me understand why this might be happening and suggest a potential solution? Notebook.zip

Code :

import glow
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
import os

# Register Glow with Spark
spark = glow.register(spark)

# Paths
vcf_path = "/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
bgen_path = "/databricks-datasets/genomics/1kg-bgens/1kg_chr22.bgen"
output_path = '/tmp/bgenCreation'
bgen_file_path = f"{output_path}/oneKG22.bgen"

# Read VCF and write as BigBgen
vcf_df = spark.read.format("vcf").load(vcf_path)
limited_df = vcf_df.limit(500)
limited_df.write.format("bigbgen").mode("overwrite").save(bgen_file_path)

# Load BGEN file
loaded_bgen = spark.read.format("bgen").load(bgen_file_path)

# Displaying the loaded BGEN file shows empty probabilities field in genotypes
loaded_bgen.limit(20).display()

# Filtered DataFrame (if relevant)
filtered_df = loaded_bgen.filter(
    expr("exists(genotypes, g -> array_contains(g.calls, 1))")
)
display(filtered_df.limit(20))

# Loading the provided BGEN for comparison
loaded_bgen_exist = spark.read.format("bgen").load(bgen_path)
loaded_bgen_exist.limit(20).display()

Aug 27 '24 12:08 RonShakutai

Hello, I believe that the VCF you're reading doesn't have the GP field from which we get the BGEN probabilities. You will need to convert the genotype calls to probabilities.

Aug 29 '24 01:08 henrydavidge