machine-learning icon indicating copy to clipboard operation
machine-learning copied to clipboard

Decisions required to reach a minimum viable product

Open dhimmel opened this issue 8 years ago • 6 comments

We're nearing the point where we'll need to implement a machine learning module to execute user queries. We're looking to create a minimum viable product. We can expand functionality later, but for now let's focus on the simplest and most succinct implementation. There are several decisions to make:

  1. Classifier: which classifiers should we support? If we want to support only a single classifier for now, which one?
  2. Predictions: do we want to return probabilities, scores, or class predictions?
  3. Threshold: do we want to report performance measures that depend on a single classification threshold? Or do we want report performance that span thresholds?
  4. Testing: Do we want to use a testing partition in addition to cross-validation? If so, do we refit a model on all observations?
  5. Features Should we include covariates in addition to expression features (see #21)?
  6. Feature selection: Do we want to perform any feature selection?
  7. Feature extraction: Do we want to perform features extraction, such as PCA (see #43)?

So let's work out these choices, with a focus on simplicity.

dhimmel avatar Sep 12 '16 23:09 dhimmel

Here are my thoughts:

  1. Classifier: sklearn.linear_model.SGDClassifier with a grid search to find the optimal l1_ratio and alpha. See 2.TCGA-MLexample.ipynb for an example.
  2. Predictions: let's return all three using the following object names probability, score, class under a predictions key. The frontend should handle cases where probability is absent.
  3. Threshold: Both.
  4. Testing: Let's hold out 10% for testing.
  5. Features deferring this decision based on the maturity of #21.
  6. Feature selection: let's do MAD feature selection to 8000 genes based on @yl565's findings in https://github.com/cognoma/machine-learning/issues/22#issuecomment-238113032. This should help speed up fitting the elastic net without too much performance loss.
  7. Feature extraction: deferring this decision based on the maturity of #43.

@gwaygenomics, @yl565, @stephenshank: do you agree?

dhimmel avatar Sep 14 '16 19:09 dhimmel

Can you clarify what you mean by number 3?

Or do we want report performance that span thresholds?

Like AUROC?

gwaybio avatar Sep 14 '16 19:09 gwaybio

By "span thresholds" I'm referring to any measure computed from predicted probabilities/scores, such as AUROC or AUPRC. By "single classification threshold", I'm referring to any measure computed from predicted classes, such as precision, recall, accuracy, or F1 score.

dhimmel avatar Sep 14 '16 19:09 dhimmel

got it. Then yes, this all looks good to me

gwaybio avatar Sep 14 '16 19:09 gwaybio

+1

yl565 avatar Sep 14 '16 23:09 yl565

Sounds good!

htcai avatar Sep 18 '16 18:09 htcai