VariantSpark issues

Error running mvn clean install

2

Hello, I am trying to install VariantSpark on a Centos 7 box, jdk 1.8, scala 2.3.1, spark 2.1.1. When I do a mvn clean install, the following test is failing....

vishaln79

AIR - unbiased gini-based importance score - Algorithm

4

This procedure provides a Gini-based variable importance method that corrects bias for different number of categories (minor-allele-frequency bias in GWAS) and also shows some promising results regarding correlation issues. the...

amnonbleich

variant name in the output file

1

The "Biallelic" option in the current version allows for two different representations of variants in the output file. - CHR_POS - CHR_POS_REF_ALT I was wondering if this option is extended...

ArashBayatDev

Optimised tree growing method

1

I recommend the following improvement to VariantSpark Random Forest importance analysis. 1. Compute and write importance score to a file after building every 1000 tree. 2. Automatically identify when enough...

ArashBayatDev

pom.xml has the same dependancy twice

1

org.json4s json4s-ext_${scala.binary.version} 3.2.11 This dependancy is there twice. Affecting the maven build.

Yatish0833

Importance is slightly biased towards last variables.

1

The procedure of selecting split variables in case of equal reduction in impurity is slightly biased towards variables with larger indexes. In the previous non-reproducible approach it was casused by...

piotrszul

Tree building performance improvements

Some ideas to consider for improved performance: * splits coming form a singel variable are likely to be very sparse -> as such it may not make sense to return...

piotrszul

enhancement

Random forest runs slower on sparse input

1

This is noticeable by comparing runtime on sparse vs dense synthetic regression datasets. The sparse ones run much slower although intuitively they should run faster.

piotrszul

Allow the spark based test to use different spark contexts

1

Make is somehow possible to group tests based on the spark context then need. Currently only one context is possible for all tests, while three different context are needed -...

piotrszul

techdebt

Batch size

When using VariantSpark Interface for Hail, a large batch size could lead to a crash in the process. For example for the following setup a batch size of 250 result...

ArashBayatDev

VariantSpark
VariantSpark copied to clipboard

Metadata

Error running mvn clean install

AIR - unbiased gini-based importance score - Algorithm

variant name in the output file

Optimised tree growing method

pom.xml has the same dependancy twice

Importance is slightly biased towards last variables.

Tree building performance improvements

Random forest runs slower on sparse input

Allow the spark based test to use different spark contexts

Batch size

← Metadata

Owner

Metadata

VariantSpark VariantSpark copied to clipboard

Metadata

← Metadata

Owner

Metadata

VariantSpark
VariantSpark copied to clipboard