Topcuoglu_ML_mBio_2020 icon indicating copy to clipboard operation
Topcuoglu_ML_mBio_2020 copied to clipboard

make permutation importance test optional, functionality to use any outcome variables, optional hyperparameter input

Open zenalapp opened this issue 6 years ago • 12 comments

Fixes #6 Fixes #7 Fixes #8 Fixes #10

zenalapp avatar Nov 11 '19 18:11 zenalapp

get_results(dataset, models, split_number, outcome="dx") errors out when outcome is defined by user. pipeline(dataset, models, split_number, outcome="dx") works. So it must be an issue when the argument is being passed from get_results to pipeline function.

BTopcuoglu avatar Nov 11 '19 22:11 BTopcuoglu

Define outcome variable and permutation logical as arguments that can be passed in the command line. We can now run from command line with:

Rscript code/learning/main.R 1 "L2_Logistic_Regression" "dx" 0

BTopcuoglu avatar Nov 12 '19 16:11 BTopcuoglu

Previous changes regarding setting outcome and perm on the command line work when the user defines the outcome (e.g. "dx") as an argument. However, if they leave that argument empty (passed as NA e.g. Rscript code/learning/main.R 1 "L2_Logistic_Regression" 0), the rest of the pipeline breaks (the first column does not get selected).

  1. I changed the order of the arguments in get_aucs function and then instead of NULL, I uses NA:

get_results <- function(dataset, models, split_number, perm=T, outcome=NA, hyperparameters=NULL)

  1. Changed pipeline function to have NA as well.

pipeline <- function(dataset, model, split_number, outcome=NA, hyperparameters=NULL, perm=T)

  1. Edited outcome=NULL argument from tuning_grid function because outcome infor should be decided in the previous functions already:

tuning_grid <- function(train_data, model, outcome, hyperparameters=NULL)

BTopcuoglu avatar Nov 12 '19 16:11 BTopcuoglu

With these changes, it looks like we Fixed #6 and #7. Next step is checking if permutation works as we want in #8.

BTopcuoglu avatar Nov 12 '19 17:11 BTopcuoglu

permutation_importance function doesn't work. Error:

Error in -sym(first_outcome) : invalid argument to unary operator
Calls: get_results ... <Anonymous> -> vars_select_eval -> map_if -> map -> .f

Caught the bug in line 87 and changed from first_outcome to outcome:

  non_correlated_otus <- full %>%
    select(-correlated_otus) %>%
    select(-sym(outcome)) %>%
    colnames()

BTopcuoglu avatar Nov 12 '19 17:11 BTopcuoglu

Nooo I was worried this wouldn't work but I wasn't doing permutation importance so I didn't catch it. I can look into another option if you don't know of one

zenalapp avatar Nov 12 '19 17:11 zenalapp

Nooo I was worried this wouldn't work but I wasn't doing permutation importance so I didn't catch it. I can look into another option if you don't know of one

I'm running it now with the change I've made (it was passing first_outcome which is not a column that can be selected but now uses sym("dx") which should work theoretically, I'll keep you posted.

BTopcuoglu avatar Nov 12 '19 17:11 BTopcuoglu

But don't we want to have dx not hard-coded?

zenalapp avatar Nov 12 '19 17:11 zenalapp

But don't we want to have dx not hard-coded?

No I know, I have it as:

  non_correlated_otus <- full %>%
    select(-correlated_otus) %>%
    select(-sym(outcome)) %>%
    colnames()

Instead of what it was before:

non_correlated_otus <- full %>%
    select(-correlated_otus) %>%
    select(-sym(first_outcome)) %>%
    colnames()

BTopcuoglu avatar Nov 12 '19 17:11 BTopcuoglu

Our previous attempt was unsuccessful - must be a bug with tidyverse. I made a new change:

  non_correlated_otus <- full %>%
    select(-correlated_otus)
  
  non_correlated_otus[,outcome] <- NULL
  
  non_correlated_otus <- non_correlated_otus %>%
    colnames()

Not the most beautiful code snippet I've written but it'll do:)

BTopcuoglu avatar Nov 12 '19 20:11 BTopcuoglu

#8 We now made permutation importance optional but the data structure to run permutation is still hardcoded. We need to come back to that and fix it.

I'll now check if user-defined hyperparameters work #10 .

BTopcuoglu avatar Nov 13 '19 18:11 BTopcuoglu

The NULL options for hyperparameters are currently specific to my CRC classification problem (except random forest, where we implement Pat's code: mtry <- floor(seq(1, n_features, length=6)) ). So those need to be adjusted and expanded in the future, but overall, I'm able to set up user-defined hyperparameters as a list and it works. Fixed #10.

BTopcuoglu avatar Nov 13 '19 19:11 BTopcuoglu