glow issues

Improve handling of environment variables in the pipe transformer

5

From https://github.com/projectglow/glow/blob/master/docs/source/tertiary/pipe-transformer.rst: > Options beginning with env_ are interpreted as environment variables. Like other options, the environment variable name is converted to lower snake case. For example, providing the option...

Hoeze

VCF files with spaces in the file name cannot be read

3

This issue is similar to the issues reported in SPARK-21996 and SPARK-23148. Would [this](https://github.com/projectglow/glow/blob/v0.6.0/core/src/main/scala/io/projectglow/vcf/VCFFileFormat.scala#L219) line need to be modified to `val hPath = new Path(new URI(path))`? (probably also other places...

arunbhat

Support whitespaces for variant datasources

1

Signed-off-by: William Brandler ## What changes are proposed in this pull request? VCF reader does not support special characters such as whitespaces, but json and csv datasource readers do. Right...

williambrandler

Simplify writing of sharded VCFs

3

From the [docs](https://glow.readthedocs.io/en/latest/etl/variant-data.html#vcf): > For the sharded VCF writer, the sample IDs are inferred from the first row of each partition and must be the same for each row. If...

Hoeze

Improve docs: Document usage of schema when reading/writing VCF

2

Super useful to clean up malformed VCFs. Example: ```python default_vcf_schema = t.StructType([ t.StructField("contigName", t.StringType()), t.StructField("start", t.LongType()), t.StructField("end", t.LongType()), t.StructField("names", t.ArrayType(t.StringType())), t.StructField("referenceAllele", t.StringType()), t.StructField("alternateAlleles", t.ArrayType(t.StringType())), t.StructField("qual", t.DoubleType()), t.StructField("filters", t.ArrayType(t.StringType())), t.StructField("splitFromMultiAllelic", t.BooleanType()),...

Hoeze

quarantine functionality in pipe transformer writes out full dataset

5

The new quarantine functionality in the pipe transformer will successfully run if there are corrupted records, however, As an example, I created an input dataframe of 961 rows. The expected...

williambrandler

Plink demo

4

Hi, Is there a plink demo? I'm looking at the sample notebook provided on the documentation page, but I'm not seeing anything for loading in and displaying a plink file....

dberma15

Set explicit guidelines when running large jobs

Spark jobs with lots of partitions can crash if the driver is too small, for example with the error, `Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size...

williambrandler

Glow implementation using Scala

1

The GLOW Documentation generally contains the demonstration of the GLOW functionality using python. The same holds for the talk conducted by Mr Amir Kermany and Mr Kiavash Kianfa, which was...

veeteehimself

ValueError: Some of types cannot be determined after inferring

This issue occurs if the phenotype data isn’t indexed to the sample id `phenotypes = pd.read_csv(quantitative_phenotypes_path, dtype={'sample_id': str}, index_col='sample_id')` Ideally, throw an exception if no samples are found

williambrandler

glow
glow copied to clipboard

Metadata

Improve handling of environment variables in the pipe transformer

VCF files with spaces in the file name cannot be read

Support whitespaces for variant datasources

Simplify writing of sharded VCFs

Improve docs: Document usage of schema when reading/writing VCF

quarantine functionality in pipe transformer writes out full dataset

Plink demo

Set explicit guidelines when running large jobs

Glow implementation using Scala

ValueError: Some of types cannot be determined after inferring

← Metadata

Owner

Metadata

glow glow copied to clipboard

Metadata

← Metadata

Owner

Metadata

glow
glow copied to clipboard