neuropredict
neuropredict copied to clipboard
Best options choice for classification of small and unbalanced dataset
Hi Pradeep,
For small and unbalanced dataset, do you recommend to use -t 0.8
or -t 0.9
?
Isn't possible to deactivate in the implemented pipeline the feature selection? If not, what is the advantage of always using feature selection when dealing with a small features' dataset?
Best, Matthieu
-k all
is equivalent to no feature selection.
there is no way to tell which percentage of training (80% or 90%) is best! Depending on the sample size, you want to ensure there is enough training (helps improve performance), while also ensuring reasonable test set sizes.. If the test set size is too small, violin plots will have large variance. So pick accordingly.
Hi Pradeep,
I tried both of them and indeed violin plots have a large variance compared to the violin plots you show in the neuropredict documentation (I have 75 CN and 15 AD).
Below with -t 0.9: balanced_accuracy.pdf
and below with -t 0.8: balanced_accuracy.pdf
- Based on these violin plots, the 80% training isn't it better (less variation)?
- How could I determine the best set of features ? Just comparing median of the 3 violin plots of my above figures? Or are there other metrics to look at?
- Where could I find mean balanced accuracy, sensitivity and specificity?
- In these binary classification cases, aren't there ROC curves plotted?
-
no clear answers there - I'd report both (one in main, and other in supplementary?)
-
you can run siginificance tests on the data saved in CSV files - look in the exported_results folder
-
they are not exported by default - will add them to exported results soon.
-
Not all predictive models have a natural ROC associated with them, hence it's not produced by default. I'll implement it soon. Current results should be enough to include in your paper?
- Don't we need to privilege violin plot with less accuracy, so 80% one?
- What kind of significance tests and which .csv files could I run these one?
- OK, thanks. Is there any way I can for the moment deduce sensitivity and specificity based on actual produced results?
- Yes.
Use the confusion matrices and the misclassification rate plots to deduce alternative performance metrics