neuropredict icon indicating copy to clipboard operation
neuropredict copied to clipboard

Best options choice for classification of small and unbalanced dataset

Open mattvan83 opened this issue 4 years ago • 5 comments

Hi Pradeep,

For small and unbalanced dataset, do you recommend to use -t 0.8 or -t 0.9 ?

Isn't possible to deactivate in the implemented pipeline the feature selection? If not, what is the advantage of always using feature selection when dealing with a small features' dataset?

Best, Matthieu

mattvan83 avatar Oct 28 '19 08:10 mattvan83

-k all is equivalent to no feature selection.

there is no way to tell which percentage of training (80% or 90%) is best! Depending on the sample size, you want to ensure there is enough training (helps improve performance), while also ensuring reasonable test set sizes.. If the test set size is too small, violin plots will have large variance. So pick accordingly.

raamana avatar Oct 28 '19 13:10 raamana

Hi Pradeep,

I tried both of them and indeed violin plots have a large variance compared to the violin plots you show in the neuropredict documentation (I have 75 CN and 15 AD).

Below with -t 0.9: balanced_accuracy.pdf

and below with -t 0.8: balanced_accuracy.pdf

  1. Based on these violin plots, the 80% training isn't it better (less variation)?
  2. How could I determine the best set of features ? Just comparing median of the 3 violin plots of my above figures? Or are there other metrics to look at?
  3. Where could I find mean balanced accuracy, sensitivity and specificity?
  4. In these binary classification cases, aren't there ROC curves plotted?

mattvan83 avatar Oct 28 '19 13:10 mattvan83

  1. no clear answers there - I'd report both (one in main, and other in supplementary?)

  2. you can run siginificance tests on the data saved in CSV files - look in the exported_results folder

  3. they are not exported by default - will add them to exported results soon.

  4. Not all predictive models have a natural ROC associated with them, hence it's not produced by default. I'll implement it soon. Current results should be enough to include in your paper?

raamana avatar Oct 28 '19 16:10 raamana

  1. Don't we need to privilege violin plot with less accuracy, so 80% one?
  2. What kind of significance tests and which .csv files could I run these one?
  3. OK, thanks. Is there any way I can for the moment deduce sensitivity and specificity based on actual produced results?
  4. Yes.

mattvan83 avatar Oct 29 '19 09:10 mattvan83

Use the confusion matrices and the misclassification rate plots to deduce alternative performance metrics

raamana avatar Oct 29 '19 11:10 raamana