nested NullType in genotypes produced in Hail interoperability function
Converting a Hail matrix table to glow works fine, but when you try to write out or perform downstream glow functions on the data it fails,
import hail as hl
hl.init(sc, idempotent=True, quiet=True)
vcf_path = '/mnt/hail/data-001/1kg_sample.vcf.bgz'
vcf_mt = hl.import_vcf(vcf_path)
mt = hl.balding_nichols_model(n_populations=3,
n_samples=10,
n_variants=10)
df = functions.from_matrix_table(mt, include_sample_ids=True)
display(df)
df.write.format("delta").save("dbfs:/tmp/test.delta")
FatalError: AnalysisException: Found nested NullType in column names which is of ArrayType. Delta doesn't support writing NullType in complex types.
df2 = df.withColumn('values', glow.mean_substitute(glow.genotype_states(col('genotypes')))) FatalError: AnalysisException: unresolved operator 'Project [contigName#2670, start#2671L, end#2672L, names#2673, referenceAllele#2674, alternateAlleles#2675, genotypes#2676, meansubstitute(genotypestates(genotypes#2676, None), -1) AS values#2702];
Note: I do not see any NullType or missing values in the dataframe, what is going on?
.drop("names") removes the NullType Error, but the "unresolved operator" error when calling mean_substitute remains.
Hey Brian, yeah dropping "names" also worked for me. Then write to delta, read back in, and it works.
It looks like there are some unexpected interactions between Hail and Glow that interfere with the registered functions. As a workaround, it may be best to avoid using Hail and Glow together. As @williambrandler demonstrated, it seems to be safer to immediately checkpoint after extraction.
this will be fixed by https://github.com/projectglow/glow/pull/377 in glow v1.1.0
glow v1.1.0 has been released, so closing this thread
note: if you still get
AnalysisException: Undefined function: 'nullif'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0
please only run export from hail to glow on a separate cluster