piotrszul comments

Results 25 comments of


                                            piotrszul

Update the Cloud Marketplace example Notebooks to include logistic regression covariates

@BauerLab as far as I can tell they are already included in the example notebook see: https://bitbucket.csiro.au/users/hos076/repos/variantspark-aws/browse/data/monitor-ami/notebook/VariantSpark_example.ipynb (or : https://variantspark-marketplace-resources.s3.amazonaws.com/static/public/example_notebook.html) ``` covariates = [mt.pheno.isFemale, mt.pcs[0], mt.pcs[1]] result = hl.logistic_regression_rows(test ='wald',...

num Case/Control in JSON model file

HI @ArashBayatDev I have added a new attribute `classCounts` to the JSON tree nodes, which is an array with the count of samples from each of the classes. Also I...

Error running mvn clean install

HI @vishaln79 , the problem is caused by the version of libstd++ packaged with Centos 7. Apparently ships with libstdc++-4.8.5 which supports CXXABI_1.3.7 while VariantSpark (an in particula Hail library)...

AIR - unbiased gini-based importance score - Algorithm

I have implemented a basic version of AIR (with the `-ic` option). In addition `-icsr` option can be used to set the random seed for label permutation so that importance...

AIR - unbiased gini-based importance score - Algorithm

HI Amnon, good to hear from you and thanks for you comments :) - the main feature of AIR is that all the variables are permutated in the same way...

AIR - unbiased gini-based importance score - Algorithm

@amnonbleich I am very curious if you have any thought on how well AIR deals with correlation. As I understand it removes bias due to different types of variables or...

variant name in the output file

I think the flag (bi-allelic variants) was the results of my evolving (mis)understanding of how variants are represented in VCF files and more precisely, what constitutes a unique key, that...

Importance is slightly biased towards last variables.

Here is in interesting info on randomness of various hashing algorithms: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Random forest runs slower on sparse input

This can be observed for example on the sparse synthetic datasets e.g. `src/test/data/synth/synth_2000_500_fact_10_0.995-wide.csv` The reason seem to the that the very sparse data result in very deep and unbalanced trees...

Allow the spark based test to use different spark contexts

Here is a good resource on how to do it with maven: https://www.baeldung.com/maven-integration-test