glow icon indicating copy to clipboard operation
glow copied to clipboard

Plink demo

Open dberma15 opened this issue 3 years ago • 4 comments

Hi,

Is there a plink demo? I'm looking at the sample notebook provided on the documentation page, but I'm not seeing anything for loading in and displaying a plink file.

Thanks.

dberma15 avatar Dec 06 '21 20:12 dberma15

hey @dberma15 plink binary ped files can be read with Glow. What is your use case for plink files and is the data in any other format (vcf / bgen)?

Cheers

williambrandler avatar Dec 07 '21 02:12 williambrandler

Hi @williambrandler, Right now though, I'm just trying to get a demo working. I've found four .bed files on databricks: /databricks-datasets/genomics/grch37/snpEff/examples/intervals.bed /databricks-datasets/genomics/grch37/snpEff/examples/my_annotations.bed /databricks-datasets/genomics/grch38/snpEff/examples/intervals.bed /databricks-datasets/genomics/grch38/snpEff/examples/my_annotations.bed

Each time I try to run the following code:

df = spark.read.format("plink").load(path+".bed".format(prefix=path))
display(df.limit(10))

I get the following error:

FileReadException: Error while reading file dbfs:/databricks-datasets/genomics/grch37/snpEff/examples/intervals.bed. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
Caused by: FileNotFoundException: No such file or directory: s3a://databricks-datasets-oregon/genomics/grch37/snpEff/examples/intervals.fam

Meanwhile, I do not get an error if I run:

vcf_path = "/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
df = spark.read.format("vcf").load(vcf_path)
display(df.limit(10))

dberma15 avatar Dec 08 '21 13:12 dberma15

ah, I believe those SNPeff input files are Browser Extensible Data (BED) format, not plink binary PED (BED) format, which awkwardly has the same suffix.

You can read Browser Extensible Data (BED) format as a tab delimited csv file plink binary PED (BED) format expects an associated .fam (and .bim) file, to learn more see the plink docs

williambrandler avatar Dec 08 '21 21:12 williambrandler

@williambrandler that would explain it. I'll try to find some plink files to try this out with and let you know how it goes.

dberma15 avatar Dec 09 '21 14:12 dberma15